Table of Contents
https://arxiv.org/abs/2202.03218
- Self-supervised learning model(wav2vec, HuBERT) gain great success
- Fine-tune the pretrained model is computationally expensive and does not scale well. $O(10^8)$ parameters per task.
Contributions:
- Applying adapters to wav2vec2.0 model to reduce the number of parameters required for down-stream tasks.
- Adapters are small trainable modules can be applied into the layers of a frozen pre-trained nedwork for a particular task.
- Fewer than 10% parameters per task with littel degradation of performance
- The first adapter modules have been applied to a self-superviced speed model for ASR.
Proposed
- Adapter module:
- down-projection
- FF (initialize as a near identity function)
- up-projection
- skip connection
- Insert twice in each transformer(gave best results)

Experiments
- Architecture: wav2vec2.0 Base (fairseq, librispeech)
- adapter size: 256, apply to all transformer layer.
- Corpus:
- En: LibriLight(10hr)
- Fr: Common Voice(10hr)
Results
Comparision on Fine-tune and Adapter
- Slightly worse on En, slightly better on Fr(multi-lingual)
- The adapters are able to compensate for language mismatch
- Trained params: 95.6% vs. 9.2%.
- Note: Adapter experiments is more quickly than fine-tune.
- less params to train
- smller optimal training steps
- do not depend on freeze steps hyper-parameters

Bilingual wav2vec2.0

Ablations
- Prior work suggests that
- lower layers: more generic speech features
- upper layers: phone discrimination
- Trained only Top-N layers
- Top-N Adapter:
- Use 4 layers of Adapter ~ 12 layers
- 6 layers performs best(4.85% params)
- Top-N fine-tune:
- 8 layers performs best(65.5% params)
- one-layer trained
- Adapter performs significantly better than fine-tuning, shows that adapters are better able to utilize pre-training knowlege
- The curve presented that top layers is more important
- supporting the hypothesis(higher is more phonemic, lower is generic) ??

Note:
- Adapter is yet enough powerful to capture language information.It doesn’t need to fine-tune full model
- save space(<10%)
- easy to train(fast convergence)
- Top-N adapter insertion performs better than 12 layer.