Table of Contents
- A extensive study on various methods for compressing BERT
- Model size, accuracy, inference speed, device…
- show different advantages and disadvantages of each methods
- gave advices and research directions to future researcher
Compression methods:
Quantization
- reducing the number of unique values(bits) to represent model weights and activations.
- Naive approach:
- truncate each weight to the target bandwidth, often yeilds a sizable drop in accuracy(quantization noise)
- To around this issue: identify these weights and not to truncate them during the quantization step.(outliners)
- Quantization-Aware Training(QAT)

Pruning
To identify and remove redudant or less important weights and/or components
- Unstructured Pruning(sparse pruning)
- prune individual weight that are least important
- Structured Pruning: block of weights
- Attention head pruning:
- the importance of heads often been questioned. it’s possible with only 1-2 attention.
- Randomly pruning during training
- Encoder unit pruning:
- Layer dropout
- Embeding size pruning
- Attention head pruning:

Knowledge distillation
training a small model(student) and learned from larger pre-trained model(teacher). There are multiple form of loss function: KL, CE, MAE
- Distillation from output logits: soft label. The student models don’t need to be smaller BERT or Transformer
- Distillation from Encoder Outputs: the output tensors of each encoder unit may contain meanningful semantic and contextual relationship
- Distillation from Attention Maps: contextual relation between tokens

Matrix Decomposition
no model size reduction but imporve runtime cost and speed
- Weight Matrix Decomposition: low-rank
- Attention Decomposition:

Dynamic Inference Acceleration
- Early Exit Ramps
- Progressive Word Vector Elimination: entire sentence must have fused into certain token

Other
- Parameter Sharing: ALBERT
- Embedding Matrix Compression: codebook
- Weight Squeezing: learn the weight transformation from teacher
Advice
- Quantization and unstructured pruning can help reduce model size but do not improve speed and memory
- suitable device for compression
- KD has shown great affinity to variety of student model and orthogonal to other methods
- BiLSTM and CNN are faster and smaller
- Compounding various compression methods together to achieve truly practical model