In this thesis, we will investigate building an audio event detection system that learns from a large number of audio clips tagged with inaccurate labels, to predict and pinpoint accurately sound events in audio signals. We adapted recent deep learning techniques and methods to tackle this problem and to demonstrate the capabilities of our method, we compare our results with the state-of-the-art systems, by evaluating our models on publicly available datasets. We will investigate various deep learning neural network architectures as well as different training strategies guided by the recent work in this field and our experiments and intuitions. Our empirical results show that stacking recurrent neural networks on the top of convolutional layers achieves the best performance when trained with our proposed "log-scaled-share mini-batch building" training strategy. With this setup, we managed to achieve better performance compared to the state-of-the-art.