๐Ÿค— The Future of Natural Language Processing - Model Size and Computational Efficiency

HuggingFace์—์„œ ์˜ฌ๋ฆฐ ์Šฌ๋ผ์ด๋“œ/์˜์ƒ์ธ The Future of Natural Language Processing์ด ์ตœ๊ทผ NLP ์ „๋ฐ˜์— ๋Œ€ํ•œ ์˜ค๋ฒ„๋ทฐ๋ฅผ ์ž˜ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์ด ์„ธ์…˜์—์„œ ๋‚˜์˜ค๋Š” ๋‚ด์šฉ๋“ค ์ค‘ Model Size, Computational Efficiency์™€ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•œ ๋‚ด์šฉ๊ณผ ๋‚ด ์ƒ๊ฐ๊ณผ ํ•จ๊ป˜ ์ •๋ฆฌํ•ด๋ณธ๋‹ค.

Youtube ๋งํฌ ์Šฌ๋ผ์ด๋“œ ๋งํฌ

Bigger Size and More Data!

computational efficiency์™€ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์€ ์•„๋ฌด๋ž˜๋„ ์ตœ๊ทผ BERT ์ดํ›„๋กœ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ํ•„์š”์„ฑ์ด ์ปค์ง€๊ณ  ์žˆ๋‹ค. Transformer Encoder ๋ธ”๋Ÿญ์ด 24 layer์ธ BERT Large๋งŒ ํ•˜๋”๋ผ๋„ 340M ์ •๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์„œ ์‹ค์‹œ๊ฐ„ ์ถ”๋ก  & ์„œ๋น™์ด ํž˜๋“ค๊ธฐ ๋–„๋ฌธ์— ๊ฐ ํšŒ์‚ฌ๋“ค์—์„œ ๋งŽ์€ ๋…ธ๋ ฅ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ์‹ฌ์ง€์–ด ์ตœ๊ทผ ๋‚˜์˜ค๋Š” ๋ชจ๋ธ๋“ค ์ค‘ T5๋Š” 1B๋ฅผ ๋„˜๊ณ  Meena์™€ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” 2.6B์ด๋‹ค.

ZeRO, Metatron๋“ฑ์œผ๋กœ ํ•ด๊ฒฐ์ด ๊ฐ€๋Šฅํ•˜๊ฒ ์ง€๋งŒ, Model/Data Parallelism์€ ํ•„์ˆ˜ ๋ถˆ๊ฐ€๊ฒฐํ•œ ์ƒํ™ฉ์ด ๋˜์—ˆ๋‹ค. T5๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋…ผ๋ฌธ์— ์„œ์ˆ ํ•ด๋†“์•˜๋‹ค.

Training large models can be non-trivial since they might not fit on a single machine and require a great deal of computation. As a result, we use a combination of model and data parallelism and train models on โ€œslicesโ€ of Cloud TPU Pods.

๊ทธ๋Ÿผ ์™œ ํฐ ๋ชจ๋ธ์ด ๋ฌธ์ œ๊ฐ€ ๋ ๊นŒ? Research Competition field๋ฅผ ๋‹จ์ˆœํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , CO2 ๋ฐฐ์ถœ๋Ÿ‰์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ๋ฌผ๋ก  ๊ทธ๊ฒŒ ์ •๋ง๋กœ ๋‹จ์ˆœํ•ด์ง€๊ฒ ๋ƒ๋งŒ์„œ๋„ ์•„๋ž˜ ํŠธ์œ—์€ ๋งŽ์€ ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ํ•œ๋‹ค.

์•„๋งˆ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ == ํฐ ๋ชจ๋ธ์ด ๊ตณ์–ด์ ธ ๊ฐ€๋Š” ๊ฒƒ์ด ๋น„๋ก ์ž˜ ๋˜๋”๋ผ๋„, ์•„์‰ฌ์šด ๋ชจ์–‘์ƒˆ์ธ ์‚ฌ๋žŒ์ด ๋งŽ์€๊ฐ€๋ณด๋‹ค. CO2 ๋ฐฐ์ถœ๋Ÿ‰์€ Strubell et al., 2019์„ ์‚ดํŽด๋ณด์ž. ํ•˜๋‚˜ ์˜ˆ์‹œ๋ฅผ ๋“ค๊ณ ์™€๋ณด์ž๋ฉด ๋‰ด์š•์—์„œ ์ƒŒํ”„๋ž€์‹œ์Šค์ฝ”๋ฅผ ๋น„ํ–‰ํ•  ๋•Œ CO2 ๋ฐฐ์ถœ๋Ÿ‰์ด 1984 lbs์ธ๋ฐ, V100 64์žฅ์„ ์‚ฌ์šฉํ•˜๊ณ  79์‹œ๊ฐ„์ด ์†Œ์š”๋œ BERT base ํ•™์Šต์€ 1438 lbs์˜ CO2๋ฅผ ๋ฐฐ์ถœํ•œ๋‹ค.

๊ทธ๋Ÿผ ๋ชจ๋ธ์„ ์ž‘๊ฒŒ ๋งŒ๋“ค์–ด๋ณผ๊นŒ?

๊ทธ๋Ÿผ ๋ชจ๋ธ์„ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์€ ์—†์„๊นŒ? ๋‹น์—ฐํžˆ ์žˆ๊ณ , Pruning์„ ์ง„ํ–‰ํ•˜๋Š” Lecun et al., 1989๋ถ€ํ„ฐ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋งํ•œ๋‹ค.

By removing unimportant weights from a network, several improvements can be expoected: better generalization, fewer training examples required, and improved speed of learning and/or classification.

๊ทธ๋Ÿผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ชจ๋ธ์„ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์–ด๋–ค ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์„๊นŒ? Distillation, Pruning, Quantization ๋“ฑ์ด ์žˆ๋‹ค.

Distillation

Distilation์€ ๊ทธ ์ค‘ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, โ€œTeacher ๋ชจ๋ธ์„ ์ด์šฉํ•ด Student ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์ž˜ ํ•™์Šต์‹œํ‚ค๋‚˜โ€ ์ •๋„๋กœ ์š”์•ฝ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. HuggingFace์—์„œ ๋งŒ๋“  ์ž๋ฃŒ์ด๋‹ˆ DistilBERT๋ฅผ ์˜ˆ์‹œ๋กœ ๋ณด์ž๋ฉด 40% ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์™€ 60% ๋น ๋ฅธ ์†๋„๋กœ BERT ์„ฑ๋Šฅ์˜ 95%๋ฅผ ๋ณด์กด์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ตœ๊ทผ์—๋„ Tsai et al., 2019, Turc et al., 2019, Tang et al., 2019๊ณผ ๊ฐ™์€ ์—ฐ๊ตฌ๋“ค์ด ๋งŽ์ด ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ์ด์ „์— ๋ฆฌ๋ทฐํ–ˆ๋˜ TinyBERT๋„ ๊ทธ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

Adaptive Inference๋ฅผ ์œ„ํ•œ Self Distillation ์—ฐ๊ตฌ๋„ ๊ฝค ๋ณด์ธ๋‹ค๋Š” ๋Š๋‚Œ์„ ๋ฐ›๋Š”๋ฐ ์ฃผ๋กœ Slimmable Neural Network์™€ ๊ฐ™์€ ์—ฐ๊ตฌ์—์„œ ๋งŽ์€ ์˜๊ฐ์„ ๋ฐ›์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

Pruning

๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ธ Pruning์€ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ์•Š๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ง„์งœ ๋‹จ์–ด์˜ ๋œป ๊ทธ๋Œ€๋กœ ๊ฐ€์ง€์น˜๊ธฐ๋ผ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. Elena Voita et al., 2019, Paul Michel et al., 2019๊ณผ ๊ฐ™์ด Multi Head Attnetion์— ์ ์šฉ์ด ๊ฐ€๋Šฅํ•œ ์—ฐ๊ตฌ๋“ค๋„ ์ง€์†์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ ์™ธ์—๋„ Wang et al., 2019์ฒ˜๋Ÿผ Weight Pruning์„ ์‚ฌ์šฉํ•ด 65%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ๊ฐ€์ง€๊ณ  99%์˜ ์„ฑ๋Šฅ์„ ๋ณด์กดํ•˜๋Š” ์—ฐ๊ตฌ๋„ ์žˆ์œผ๋ฉฐ, ์ตœ๊ทผ ICLR 2020์— Accept๋œ Fan et al., 2020๊ณผ ๊ฐ™์ด Transformer Depth๋ฅผ ์ค„์ด๋Š”, Layer Pruning์„ ์ง„ํ–‰ํ•˜๋Š” ์—ฐ๊ตฌ๋„ ์žˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ๋„คํŠธ์›Œํฌ๋Š” Dense Matrix Multiplication์„ ์—ผ๋‘์— ๋‘๊ณ  ๋””์ž์ธ, ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ, Sparse Model๋„ ์ตœ๊ทผ ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ๊ฐœ์ธ์ ์ธ ์ƒ๊ฐ์€ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋‚˜, ํšจ์œจ์„ฑ์œผ๋กœ ๋ณผ ๋•Œ, ๋ฌผ๋ก  GPU์ƒ์—์„œ ํšจ์œจ์ ์ธ Sparse Matrix ์—ฐ์‚ฐ์ด ์–ด๋ ต์ง€๋งŒ, Sparse Model๋“ค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ด๋ฃจ์–ด์ง€๋ฉด ์ข‹๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์„ ๋งŽ์ด ํ•œ๋‹ค. ๋ถ„๋ช…ํžˆ ์–ด๋ ค์šด ๋ถ„์•ผ์ด์ง€๋งŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ์œ ์ง€๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด Memory Consumption์ด๋‚˜ Inference Speed ๋ถ€๋ถ„์—์„œ 1.nx ๊ฐ™์€ โ€œ๊ฐœ๋Ÿ‰โ€์ •๋„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๋ณด๋‹ค๋Š” Nx ์ •๋„์˜ ์—„์ฒญ๋‚œ ํ–ฅ์ƒ์„ ๋ณผ ๊ฒƒ์ด ๋„ˆ๋ฌด ๋ถ„๋ช…ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ด€๋ จ ์—ฐ๊ตฌ๋Š” Open AI์˜ ๊ธ€์ธ Block-Sparse GPU Kernels, Balanced Sparsity (Yao et al., 2018)๋“ฑ๋ถ€ํ„ฐ ๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

Quantization

์„ธ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ Quantization์ธ๋ฐ, Tensor๋“ค์„ downcasting ํ›„ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ตœ๊ทผ์—๋Š” CPU Inference๋ฅผ ์œ„ํ•ด INT8๋กœ ์ค„์ด๋Š” ๊ฒƒ์ด ๋Œ€์„ธ๊ฐ€ ๋œ ๋“ฏ ํ•˜๊ณ , PyTorch ํŠœํ† ๋ฆฌ์–ผ์˜ (EXPERIMENTAL) DYNAMIC QUANTIZATION ON BERT์ด๋‚˜ Q8BERT๋ฅผ ์ฐธ๊ณ ํ•ด๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ ์ƒ๊ฐํ•˜๋Š” ์ด ๋ถ„์•ผ์˜ ํ•ต์‹ฌ์€ โ€œ์–ด๋–ป๊ฒŒ Distribution์„ ์žƒ์ง€ ์•Š๊ณ  ์ž˜ ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ์„๋ผ?โ€์ด๋‹ค. Quantization์€ Tensor๋“ค์„ Downcastํ•˜๋‹ˆ ์ •๋ณด๋ฅผ ์žƒ์„ ์ˆ˜ ๋ฐ–์— ์—†๋Š”๋ฐ, ์ด ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Fake Quantized Tensor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Quantization Aware Training์ด๋‚˜ Post Training Quantization์„ Calibration Data์™€ ํ•จ๊ป˜ ์ ์ ˆํ•œ Threshold๋ฅผ ์ฐพ์•„ outlier๋ฅผ ์ œ์™ธํ•˜๋Š” ๋ฐฉ์‹์˜ quantization์ด ์ž˜ ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

๋‹ค๋งŒ ์•„์ง ์•„์‰ฌ์šด ์ ์€ Intel CPU๊ฐ€ ์•„๋ฌด๋ž˜๋„ AWS์—์„œ ๋งŽ์€ ์ ์œ ์œจ์„ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๊ณ , TensorFlow๊ฐ€ ์„œ๋น™์— ๊ฐ•์„ธ๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์€ ์ž๋ช…ํ•œ ์‚ฌ์‹ค์ธ๋ฐ, TensorFlow์˜ INT8 ์—ฐ์‚ฐ ์ง€์›์ด ์•„์ง์€ ํ™œ๋ฐœํ•˜์ง€ ์•Š๋‹ค๋Š” ์ ์ด๋‹ค. TensorFlow๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ค‘ ํ•˜๋‚˜์ธ google/gemmlowp์—์„œ๋„ ์•„์ง AVX2์ง€์›์ด๋‹ค. VNNI๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด (์•„๋งˆ Custom Ops๋กœ ์‚ฌ์šฉํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋Š”๋ฐ) ์ง์ ‘ ๋ช‡๋ช‡ ์—ฐ์‚ฐ์ž๋ฅผ ๊ฐœ๋ฐœํ•ด ์‚ฌ์šฉํ•œ Bhandare et al., 2019์™€ ๊ฐ™์€ ๊ฒฝ์šฐ๋ฅผ ๋ณผ ๋•Œ ๋” TensorFlow ์ž์ฒด์—์„œ ์ง€์›์ด ํ™œ๋ฐœํ•˜๋ฉด ์ข‹๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ ๋‹ค. TF Lite๊ฐ€ INT8 ์—ฐ์‚ฐ์„ ์ง€์›ํ•˜๊ธด ํ•˜์ง€๋งŒ, ์ง์ ‘ int8 ์—ฐ์‚ฐ์„ ๋ฏธ์„ธํ•˜๊ฒŒ ์กฐ์ •ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์‚ฐ์˜ ์ง€์›์ด ๋ฏธ์•ฝํ•ด๋ณด์ธ๋‹ค.

_

๋˜๊ฒŒ ์žฌ๋ฐŒ๋Š” ์˜์ƒ์ด๊ณ  ์„ค๋ช…์„ ์ž˜ํ•ด์ค€๋‹ค. ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”์™€ ๊ด€๋ จ๋œ ์ฃผ์ œ๋Š” ๋ฉ”์ธ์ด ์•„๋‹๋ฟ๋”๋Ÿฌ, Continual and Meta Learning, Common Sense Question, Out of domain generalization, NLU vs NLG๋“ฑ์˜ ์žฌ๋ฐŒ๋Š” ์ฃผ์ œ๋„ ๋งŽ์ด ๋‹ค๋ฃฌ๋‹ค.

May 11, 2020
Tags: nlp