MIXOUT: Effective Regularization to Finetune Large-scale Pretrained Language Model

TL;DR

“Mixout” is a technique that stochastically mixes the parameters of two models (in this paper, two models are usually the pretrained model and the model that is in finetuning).
Applying mixout significantly stabilizes the results of finetuning BERT_large on small training sets.
Pytorch implementation from the author: https://github.com/bloodwass/mixout
Forward function is here (bloodwass/mixout/mixout.py#L59)

In this paper, the authors introduce a new regularization technique, “mixout”, motivated by “dropout”.
“Mixout” is a technique that stochastically mixes the parameters of two models.
The authors evaluated this via finetuning BERT_large on downstream tasks in GLUE.

The authors will provide a theoretical understanding of the dropout and its variants, and empirically verify with two experiments.
1. Train a fully-connected network on EMNIST Digits and finetune it on MNIST.
2. (Main Experiments) Finetune BERT_large on training sets of GLUE.
In the ablation study, the authors will perform three experiments.
1. The effect of the mixout on a sufficient number of training sets.
2. The effect of a regularization technique for an additional output layer which is not pre-trained.
3. The effect of the probability of mixout compared to dropout.

Mixconnect
- If the loss function is strongly convex, mixconnect term can act as an L2 regularizer term.
  - Check this link for a detailed description of Strong Convexity.

Mixout
- The authors propose the mixout as a special case of a mixconnect, which is motivated by the relationship between dropout and dropconnect.
- Mixout chooses a random mask matrix from Bernoulli(1 - p), so an L2 regularization coefficient is mp/(1 - p). (Check details in the paper)
  - It means that the probability of the mixout can adjust the strength of the L2 penalty.
Mixout for Pretrained Model
- When training from scratch, an initial model parameter is usually sampled from a normal/uniform distribution with mean 0 and small variance, but after training, the model parameter is away from the origin with a large t (training step). (Hoffer et al., 2017)
- Because we obtain pre-trained weight by training on a large corpus, it is often far away from the origin.
- Dropout L2-penalizes the model parameter for deviating away from the origin rather than the pre-trained weight.
- So, it should be better to use the mixout to explicitly prevent the deviation from the pre-trained weight.

Weight decay is an effective regularization technique to avoid catastrophic forgetting during finetuning(Wiese et al., 2017), and the authors suspect that the mixout has a similar effect with weight decay.
To verify, the authors pre-trained a fully-connected network and finetuned with replacing dropout with the mixout.
- Any regularization techniques such as weight decay are not used.
The result shows that the validation accuracy of the mixout has greater robustness to the choice of probability than that of dropout.

Notation
- Weight decay here means an L2 weight decay. (\(wdecay(u, \lambda) = \frac \lambda 2 {\lVert w - u \lVert}^2\), \(w\) is the weight to optimize.)

The authors choose RTE, MRPC, CoLA, and STS-B tasks because these tasks have been observed as unstable to finetune BERT_large (Phang et al., 2018).
The original regularization strategy (Devlin et al., 2018) for finetuning is using both dropout and \(wdecay(\textbf 0)\).
But mixout or \(wdecay(w_{pre})\) (\(w_{pre}\) is the pre-trained weight) cannot be used in the output layer because there is no pre-trained weight for the output layer.
So in this experiment, the regularization strategy for the output layer uses the dropout and \(wdecay(\textbf 0)\).
Figure 3 shows the results for four regularization strategies.
1. dropout 0.1 and \(wdecay(\textbf 0, 0.01)\) (Devlin et al., 2018)
2. \(wdecay(w_{pre}, 0.01)\) (Wiese et al., 2017)
3. mixout 0.7
4. 2 + 3
In short, applying mixout significantly stabilizes the results of finetuning BERT_large on small training sets regardless of whether using \(wdecay(w_{pre}, 0.01)\).

In section 3 (Analysis of Dropout and Its Generalization), the authors explained mixout does not differ from dropout when training a randomly initialized layer because weight is sampled from the distribution whose mean and variance are zero and small, respectively.
Since the expectation value of the initial weight is proportional to the dimensionality of the layer, mixout behaves differently from dropout when training from scratch.

Mixout with probability 0.7, 0.8, and 0.9 yields better average dev scores, and reduces the number of failed finetuning runs.
But finetuning using mixout takes more time than dropout. (843 seconds vs 636 seconds)

September 8, 2020

Tags: paper