The Lottery Ticket Hypothesis - essential papers

The Lottery Ticket Hypothesis has recently been a hot topic in deep learning community. In this article, I review 4 essential papers related to the Lottery Ticket Hypothesis. Let us take a closer look step by step what is the Lottery Ticket Hypothesis, how it was developed, and why it is important.

Pruning – historical context

The Lottery Ticket Hypothesis uses a 30-year-old concept (LeCun et al. 1990 [3]) from training NNs (neural networks) – pruning. Pruning is applied at the end of training by setting a given % (typically it can be 90% or 99%) of the lowest weights in the NN to 0. The assumption is that the lowest weights play little role in the network.

After pruning the NN (which now consists of sparse matrices) can still be used for inference, just like the initial trained NN with all the weights. Typically, pruning can reduce the number of parameters by orders of magnitude without compromising the accuracy too much.

Pruning has long been considered a standard approach for NN compression after the training process. Apart from some refinements – for example, variations of structured pruning, there have been no substantial changes in the key concepts related to pruning until now. The picture below shows the process of training and pruning a NN.

Introduction of the Lottery Ticket Hypothesis

It has long been believed that pruning is only a post-processing step and that pruned networks cannot be trained from scratch. Recently a new idea emerged that not only takes pruning one step further but also changes our understanding of the most fundamental aspects of NN training. It is the Lottery Ticket Hypothesis (Frankle, Carbin 2019 [1]).

It turns out, that pruned networks can be used to create so-called winning tickets. Simply, we take the pruned network and change the non-zero weights back to their original random initializations from the beginning of the training process. The resulting network is called the winning ticket. Such winning tickets can be later trained from scratch and achieve the same or better performance as the original network. The only condition that after the pruning, the remaining weights are changed back to their original initialization. The combination of the pruning mask and lucky initial weights is essential to the success of a winning ticket. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. (Frankle, Carbin 2019 [1])

Take a look at the picture below to see how winning and random lottery tickets are created from pruned NNs. The definition of random lottery tickets will be introduced later.

winning vs random lottery ticket — The process of creating winning and random lottery tickets from pruned neural networks (NNs). The definition of random lottery tickets will be introduced later. Image by Aleksandra Chrabrowa.

Essential papers

There are 4 papers that describe the essential work related to the lottery ticket hypothesis, all released in a short time frame. The first 2 papers are from MIT, the other 2 are from Facebook.

Frankle, Carbin The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (2019) [1]. This paper introduces the lottery ticket hypothesis. Winning lottery tickets are efficiently found only for smaller networks. Experiments are on image datasets only.
Frankle, Dziugaite, et al.Stabilizing the Lottery Ticket Hypothesis (2019) [2]. This paper adds some refinements to the procedure of obtaining winning tickets for larger networks. Again only image datasets are used.
Morcos, Yu, et al.One ticket to win them all: generalizing lottery ticket hypothesis initializations across datasets and optimizers (2019) [4]. This paper explores the idea of transferring winning tickets obtained with one training configuration (optimizer and dataset) to another. Again only image datasets are used.
Yu, Edunov, et al. Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP (2020) [5]. This paper describes first winning tickets in RL and NLP domains. It also reviews various techniques for finding the winning tickets.

These 4 papers are an interesting example of how a new research idea develops in a series of followup papers. The rest of the article describes the relationship between the ideas in these 4 papers.

The Lottery Ticket Hypothesis tricks of the trade – a glossary

The Lottery Ticket Hypothesis is still a work in progress. Substantial computational resources needed for experiments slow down the work. Optimal strategies for generating winning tickets are determined in multiple papers. Below is a glossary of key concepts, ordered as they were introduced.

winning lottery ticket – a subnetwork obtained after pruning. The weights that remain in the winning ticket are reset to the exact same values as in the original initialization for the original training during which the pruning was performed.
random lottery ticket – a subnetwork obtained after pruning. In contrast to winning tickets, random tickets weights are randomly reinitialized and are different from the original training.

Usage: All 4 papers examine winning and random tickets. The authors agree that winning tickets outperform random tickets. However, sometimes winning tickets are hard to find. Winning and random tickets are shown in the picture above.

one-shot pruning - the simplest pruning approach. The network is trained only once and then the desired fraction of weights is pruned.
iterative pruning – repetitive training, pruning a small fraction of weights, and resetting the NN until the desired fraction of weights is pruned. Computationally expensive.

Usage: Iterative pruning was first used by Frankle, Carbin (2019) [1]: ... iterative pruning finds winning tickets that match the accuracy of the original network at smaller sizes than does one-shot pruning. All 4 papers use iterative pruning and confirm its advantage over one-shot pruning.

global pruning – pruning the desired fraction of the lowest weights across all the layers. Different layers may be pruned to a different degree. This is opposed to local pruning, where a set fraction of the lowest weights is pruned from each layer separately.

Usage: Global pruning is used by Frankle, Carbin (2019) [1] for larger NNs, whereas local pruning is used for smaller NNs. Global pruning is essential for Morcos, Yu, et al. (2019)[4]. It is also used by Yu, Edunov, et al. (2019) [5].

learning rate warmup – starting the training with a lower learning rate and linearly increasing it to the benchmark value while generating winning tickets.

Usage: This was necessary for Frankle, Carbin (2019) [1] for large NNs (Resent-18 and VGG-19) but was not deemed satisfactory due to tricky hyperparameter tweaking. Replaced by late rewinding.

late rewinding – a modified way of generating winning tickets. The weights remaining after pruning are set to the values at some early training iteration are chosen (typically between 0.1%-7% training progress) instead of iteration 0.

Usage: It was first examined by Frankle, Dziugaite, et al. (2019) [2]. It enabled finding winning tickets for large NNs (for example Resnet-50), eliminated the need for learning rate warmup, and allowed pruning more weights. The importance of late rewinding was confirmed by Morcos, Yu, et al. (2019) [4]. Later Yu, Edunov, et al. (2020) [5] stated that late rewinding improved performance but was not essential.

To sum up, best practices for generating winning tickets are an active area of research. Iterative pruning and rigorously examining all pruning parameters are the major computational bottlenecks. However, it seems that the basic methodology for generating winning tickets has emerged over time.

One plot explains it all

Papers often begin with one plot that presents the most important results of the conducted research and that authors wish to be the main take-home message of the paper. Let us look at the first plot in the first paper about the Lottery Ticket Hypothesis - Frankle, Carbin (2019) [1].

Figure 1 from Frankle, Carbin (2019) [1]. The iteration at which early-stopping would occur (left) and the test accuracy at that iteration (right) of the Lenet architecture for MNIST and the Conv-2, Conv-4, and Conv-6 architectures for CIFAR10 (see Figure 2) when trained starting at various sizes. Dashed lines are randomly sampled sparse networks (average of ten trials). Solid lines are winning tickets (average of five trials). Take-home message: winning tickets *match the test accuracy of the original network after training for at most the same number of iterations.*[1]

Figure 1 from Frankle, Carbin (2019) [1] compares the performance of winning and random tickets as a function percent of weights remaining after pruning. For each experiment, both curves have a common starting point at 100% of weights remaining (left side of the plots). This is because 100% of weights remaining means no pruning. If there is no pruning, winning and random tickets are equivalent. However, as the pruning rate increases, the behaviour of winning and random tickets diverge. In particular, training time (proportional to early stopping iteration - plotted on the first two charts) remains almost constant until the percent of weights remaining is as low as 5-10%, with a flat minimum somewhere in the middle. As training time remains constant or even slightly better, the accuracy does not suffer much - the solid curves on the last two charts are almost flat until quite high pruning rates.

We can generate winning tickets which have only 5-10% of weights and almost the same accuracy!

Transferring winning tickets

Winning tickets are costly to find. Discovering them gave us a better understanding of fundamental concepts related to NNs. But are there any practical implications? Pruning has a simple practical application – NN compression. What is the advantage of finding a small subnetwork that can be successfully trained from scratch if in order to do so the full network must be trained beforehand?

Morcos, Yu, et al. (2019) [4] seem to have an answer for that. They introduce winning ticket transfer – finding the winning ticket for one training configuration (optimizer, dataset) and using it to train a NN with the same architecture but different configuration. The results are good in particular when transferring from larger datasets to smaller ones. This opens new opportunities for transfer learning - transferring optimal (pruned) initializations of large NNs from larger datasets to smaller datasets.

Summary

The 4 papers described above are an interesting example of how a good idea develops. At first preliminary results in the carefully selected environment are presented – smaller NNs, only image domain (Frankle, Carbin 2019 [1]) Gradually, technical details take its final shape (Frankle, Dziugaite et al. 2019 [2]; Morcos, Yu et al. 2019 [4]; Yu, Edunov et al. 2020 [5]) and the idea is transferred to other domains – NLP, RL. Practical applications are discovered – winning ticket transfer (Morcos, Yu et al. 2019 [4]). It is beneficial to publish some early stages of your work (if your idea is good enough).

On the other hand, this rapid process of development of the Lottery Ticket Hypothesis makes it a little bit convoluted to get a gist of the story from multiple papers. Fun fact: look at Frankle, Carbin 2019 [1] or Frankle, Dziugaite et al. 2019 [2] - these papers have between 3-5 versions and each of them changed the title once.

What remains to be done? Let the authors (Frankle, Carbin 2019 [1]) voice their concerns: Sparse pruning is our only method for finding winning tickets. Although we reduce parameter-counts, the resulting pruning architectures are not optimized for modern libraries or hardware... In future work, we intend to study other pruning methods from extensive contemporary literature, such as structured pruning (which would produce networks optimized for contemporary hardware) and non-magnitude pruning methods (which could produce smaller winning tickets or find them earlier).

References

[1] Jonathan Frankle and Michael Carbin, The lottery ticket hypothesis: Finding sparse, trainable neural networks (2019), in International Conference on Learning Representations

[2] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the Lottery Ticket Hypothesis. (2019)

[3] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. (1990), in Advances in neural information processing systems, 598–605.

[4] Ari S. Morcos, Haonan Yu, Michela Paganini, Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. (2019), in Conference on Neural Information Processing Systems

[5] Haonan Yu, Sergey Edunov, Yuandong Tian, Ari S. Morcos. Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP.(2020), in International Conference on Learning Representations

Who am I?

I am Aleksandra Chrabrowa and here is my LinkedIn profile.

You can contact me via e-mail at ok.l1m3k at gmail.com

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.