When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece - be it for an academic interest in emulating human perception, or for practical applications -, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations - referred to as deep neural networks - and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ fully-connected neural networks for music and speech detection and to accelerate music similarity measures, and convolutional neural networks for detecting note onsets, musical segment boundaries and singing voice. In doing so, I evaluate both how well and in what way the networks solve the respective tasks. Using the example of singing voice detection, I additionally develop data augmentation methods to learn from only few annotated music pieces, and a recipe to obtain temporally accurate predictions from inaccurate training examples.
The results of my work surpass the previous state of the art in all the tasks considered. The learned solutions are similar to existing hand-designed approaches, but are more extensively optimized than possible by hand. Both indicates that the same methods could also yield substantial improvements for other machine listening problems. The self-contained description of my work - including a thorough introduction to all relevant deep learning and signal processing techniques - and my contributions to several open-source software projects shall help other researchers and practitioners to accomplish exactly that. In conclusion, this thesis both advances the state of the art in five concrete applications, and, on a higher level, participates in the ongoing democratization of deep learning.