In the passed decade, deep learning has achieved state-of-the-art performance for various machine learning tasks. The wide applicability of deep learning mainly arises from the ability of deep neural networks to learn useful features by themselves. By combining multiple layers in a neural networks, hierarchical representations can be created with an increasing level of abstraction. However, there is one fundamental difficulty in learning deep neural networks, which is known as the vanishing gradient problem. Today, this issue has been alleviated by new activation functions and better ways to initialise weights. Although less severe, also internal covariate shift slows down learning in deep networks. Several techniques such as batch normalisation have been proposed to counter this issue by normalising the data in each layer. An alternative approach is to construct the layers in a network so that their activations have the same mean and variance as the data in the input. ^Networks which are capable of doing so are called self-normalising neural networks (SNNs) and can be constructed by enforcing certain characteristics onto the mappings that each layer induces on the moments of its input data.
In this thesis, we extend the idea of analysing the moments in SNNs to the backward pass, in which we both theoretically and empirically investigate the backward dynamics. Further, we extend SNNs to networks with bias units and show that similar dynamics hold as for the weights. Additionally, we compare the performance and learning behaviour of SNNs with respect to different data normalisation techniques, weight distributions for initialisation and optimisers on several machine learning benchmarking data sets. ^We find that (1) the variance of the weights steadily increases with very small steps in networks with random errors, (2) the architecture affects how the error signal propagates back through the network and (3) that an error signal with reduced variance in lower layers can be advantageous for learning. Furthermore, our analysis reveals that SNNs perform best with whitened data, the choice of the initial weight distribution has no significant effect on the learning behaviour and that most adaptive learning rate schedules do help, although they break the conditions for self-normalisation.