[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition


  • Self-supervised learning model(wav2vec, HuBERT) gain great success
    • Fine-tune the pretrained model is computationally expensive and does not scale well. $O(10^8)$ parameters per task.


  • Applying adapters to wav2vec2.0 model to reduce the number of parameters required for down-stream tasks.
    • Adapters are small trainable modules can be applied into the layers of a frozen pre-trained nedwork for a particular task.
    • Fewer than 10% parameters per task with littel degradation of performance
  • The first adapter modules have been applied to a self-superviced speed model for ASR.


  • Adapter module:
    • down-projection
    • FF (initialize as a near identity function)
    • up-projection
    • skip connection
  • Insert twice in each transformer(gave best results)


  • Architecture: wav2vec2.0 Base (fairseq, librispeech)
  • adapter size: 256, apply to all transformer layer.
  • Corpus:
    • En: LibriLight(10hr)
    • Fr: Common Voice(10hr)


Comparision on Fine-tune and Adapter

  • Slightly worse on En, slightly better on Fr(multi-lingual)
    • The adapters are able to compensate for language mismatch
  • Trained params: 95.6% vs. 9.2%.
  • Note: Adapter experiments is more quickly than fine-tune.
    • less params to train
    • smller optimal training steps
    • do not depend on freeze steps hyper-parameters

Bilingual wav2vec2.0


  • Prior work suggests that
    • lower layers: more generic speech features
    • upper layers: phone discrimination
  • Trained only Top-N layers
  • Top-N Adapter:
    • Use 4 layers of Adapter ~ 12 layers
    • 6 layers performs best(4.85% params)
  • Top-N fine-tune:
    • 8 layers performs best(65.5% params)
  • one-layer trained
    • Adapter performs significantly better than fine-tuning, shows that adapters are better able to utilize pre-training knowlege
  • The curve presented that top layers is more important
    • supporting the hypothesis(higher is more phonemic, lower is generic) ??


  • Adapter is yet enough powerful to capture language information.It doesn’t need to fine-tune full model
    • save space(<10%)
    • easy to train(fast convergence)
  • Top-N adapter insertion performs better than 12 layer.