[Note] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

  • A extensive study on various methods for compressing BERT
    • Model size, accuracy, inference speed, device…
  • show different advantages and disadvantages of each methods
  • gave advices and research directions to future researcher

Compression methods:


  • reducing the number of unique values(bits) to represent model weights and activations.
  • Naive approach:
    • truncate each weight to the target bandwidth, often yeilds a sizable drop in accuracy(quantization noise)
  • To around this issue: identify these weights and not to truncate them during the quantization step.(outliners)
  • Quantization-Aware Training(QAT)


To identify and remove redudant or less important weights and/or components

  • Unstructured Pruning(sparse pruning)
    • prune individual weight that are least important
  • Structured Pruning: block of weights
    • Attention head pruning:
      • the importance of heads often been questioned. it’s possible with only 1-2 attention.
      • Randomly pruning during training
    • Encoder unit pruning:
      • Layer dropout
    • Embeding size pruning

Knowledge distillation

training a small model(student) and learned from larger pre-trained model(teacher). There are multiple form of loss function: KL, CE, MAE

  • Distillation from output logits: soft label. The student models don’t need to be smaller BERT or Transformer
  • Distillation from Encoder Outputs: the output tensors of each encoder unit may contain meanningful semantic and contextual relationship
  • Distillation from Attention Maps: contextual relation between tokens

Matrix Decomposition

no model size reduction but imporve runtime cost and speed

  • Weight Matrix Decomposition: low-rank
  • Attention Decomposition:

Dynamic Inference Acceleration

  • Early Exit Ramps
  • Progressive Word Vector Elimination: entire sentence must have fused into certain token


  • Parameter Sharing: ALBERT
  • Embedding Matrix Compression: codebook
  • Weight Squeezing: learn the weight transformation from teacher


  • Quantization and unstructured pruning can help reduce model size but do not improve speed and memory
    • suitable device for compression
  • KD has shown great affinity to variety of student model and orthogonal to other methods
  • BiLSTM and CNN are faster and smaller
  • Compounding various compression methods together to achieve truly practical model