It is well-known that although overparametrized neural networks generalize well, the minimum number of weights required to obtain a good generalization performance is much smaller than model sizes typically used by practitioners [1,2,3]. If we would like to use neural networks in real-time on our mobile phones - we need sparser networks to save battery power and reduce memory storage. In this blogpost I want to discuss the most important approaches to obtain sparsity in neural networks: pruning [4, 5, 1], dropout [6, 7], and knowledge distillation [8, 9]. Other approaches not discussed here include sparsity-inducing priors [10], and regularization based methods, such as L0 regularization [11].

Pruning
Lets start with the easiest method - pruning [4,5]. There are 4 steps:

  1. Train a large network until it generalizes well
  2. Remove a bunch of weights - and all connections to/from these weights.
  3. * To select which weights to remove, one can use e.g., a magnitude-based measure [12] or information about the loss surface [4].
  4. Train the new sparser network until it generalizes well.
  5. Repeat step 1-3 until the desired sparsity level has been achieved.

Pruning can result in networks with over 90% sparsity, without losing a noticeable amount of test accuracy compared to the original dense network. Pruning was originally conceived as a method to prevent overfitting and thus improve the generalization capacity of the network. In fact, pruning has indeed been found to increase accuracy in some cases [13]. At the other hand though, contrary to our intuition, large dense networks appear to generalize remarkably well (see my other blogpost on Generalization Error for a further discussion on this topic), so why would pruning have this effect?

One explanation could arise from the minimum description length principle. Pruning makes the learned neural network parameters less sensitive to noise, which lowers the description length required and thus improves the generalization capacity [14].

A natural question would be: if sparser networks can get good generalization performance, why bother starting with a dense network and then doing the whole pruning procedure? Can’t you just start off with a sparser network and train this?

Until recently, this was thought not to be possible. The redundancy of large, dense neural networks was thought to be vital to obtain good test accuracies. However, in 2018 Frankle and Carbin [15] claimed that they could actually train sparse networks from scratch and obtain good test accuracies. They find these sparse networks by pruning large, dense networks, but then reset the values of the weights of the sparse networks to their values before training, thus training the sparse network in isolation from the denser network. They argue that using the exact same initalization for the sparse network as was used for the larger network is important for obtaining good generalization performance for the sparse network. As comments on this work Liu et al. [16] argue that using random initialization for the sparse network also leads to good test accuracies and Zhou et al. [17] think that one can ignore the magnitudes from the original initialization, but that the sign of the initial weights plays a key role.

All these papers share the same take-away message:

Pruning effectively acts as an architecture search.

Pruning allows us to build a skeleton for the neural network, where the exact values of the trained parameters are less important [19].

Dropout
In dropout [6,7] we only use part of the network at a time. For each iteration we randomly select weights that will be temporarily ignored (“dropped out”), where each weight has probability p of getting dropped. Weights which are switched off do not contribute to the forward pass and are not updated, but can be re-activated for the next iteration. The key idea of dropout is:

Dropout combines information from an ensemble of sparse networks, among which many weights are shared. Randomly dropping out weights at each step mitigates overfitting by ensuring that the neural network doesn’t memorise the full training set.

Dropout can be used to obtain sparse networks. For example, in targeted dropout [20] they use dropout to reduce the networks dependence on unimportant weights (e.g., weights with a small magnitude), which subsequently can easily be pruned away without a loss in accuracy. Other examples include sparse variational dropout [21] and Ising-dropout [22].

Knowledge Distillation
The main idea of the knowledge distillation method [8,9] is to distill knowledge from a deep, dense “teacher” network to a small “student” network, without losing test accuracy. There are confidence estimates hidden in the softmax layer outputs, which can be extracted by raising the temperature at the softmax layer. This “dark knowledge” allows the teacher network to confer more information to its student network.

I do want to note at this point that one should not directly equate the softmax outputs to model confidence, as shown by Gal and Ghahramani [23]. DNNs are typically overconfident in their predictions and can misclassify unidentifiable images with high confidence [24]. Furthermore, small adversarial perturbations to a training image can also completely change softmax outputs to random values [25].

Discussion
In the introduction I mentioned that sparsity in neural networks is desirable as it allows us to use neural networks in real-time on resource-constrained computing environments. In practice however, not all forms of sparsity are equally effective in lowering the computational cost. Most compression methods lead to networks with non-structured sparsity, which limits the amount of speed-up obtained [29]. Something to consider when developing new compression methods.

References
[1] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. NIPS, 28:1135–1143, 2015.

[2] K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression. ICLR, 2017.

[3] C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv:1804.08838, 2018.

[4] Y. LeCun, J.S. Denker, and S.A. Solla. Optimal brain damage. NIPS, pages 598–605, 1990.

[5] B. Hassibi and D.G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. NIPS, 1993.

[6] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.

[7] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[8] C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model compression. International Conference on Knowledge discovery and Data mining. ACM., 2006.

[9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. NIPS Deep Learning and Representation Learning Workshop, 2015.

[10] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. arXiv:1705.08665, 2017.

[11] C. Louizos, M. Welling, and D.P. Kingma. Learning sparse neural networks through L0 regularization. CoRR, 2017.

[12] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. ICML, 2018.

[13] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv:1710.01878, 2017.

[14] B. R. Bartoldson, A. S. Morcos, A. Barbu, and G. Erlebacher. The generalization- stability tradeoff in neural network pruning. arXiv:1906.03728, 2019.

[15] J. Frankle and M. Carbin. The lottery ticket hypothesis: Training pruned neural networks. arXiv:1803.03635, 2018.

[16] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. CoRR, 2018.

[17] H. Zhou, J. Lan, R. Liu, and J. Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv:1905.01067, 2019.

[18] T. Gale, E. Elsen, and S. Hooker. The state of sparsity in deep neural networks. arXiv:1902.09574, 2019.

[19] E. J. Crowley, J. Turner, A. Storkey, and M. O’Boyle. Pruning neural networks: is it time to nip it in the bud? arXiv:1810.04622, 2018.

[20] A. N. Gomez, I. Zhang, S.R. Kamalakara, D. Madaan, K. Swersky, Y. Gal, and G. E. Hinton, Learning Sparse Networks Using Targeted Dropout. arXiv:1905.13678, 2019.

[21] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. ICML, 2017.

[22] H. Salehinejad and S. Valaee. Ising-dropout: A regularization method for training and compression of deep neural networks. arXiv:1902.08673, 2019.

[23] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv:1506.02142, 2015.

[24] A.M. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. CoRR, 2014.

[25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ICLR, 2014.

[26] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. NIPS, pages 2074–2082, 2016.