
- https://arxiv.org/abs/2006.11477
- Self-supervised speech representation
- contrastive loss: masked continous speech input -> quantized target
- quantized module: gumbel softmax(latent representation codebooks)
- wav2vec2.0 Large with 10min data: 5.2/8.6 LS-clean test
- Fairseq
- Well explained: https://neurosys.com/wav2vec-2-0-framework

Feature Encoder(CNN)
將 raw audio encode 成 latent speech representation
Contextualized representation(Transformers)
- 將 Mask 過後,經過 Transformers 產生 contextualized representation (看過前後)
- Note: use relitive positional embeddings
Quantization module
- wav2vec2.0 將目標表示為有限的 representation,比起學習還原連續的 output 有更好的效果
- 要怎麼離散化? 如何將 表示成離散的表示法?
- 給定 codebooks 每本 codebook 裡面有 個向量()
- 每個 會從每本 codebook 中選擇一個向量,並連接起來
- 最後經過一層 Linear 轉換為
- 怎麼訓練?
- 使用 Gumbel softmax:可微分
- 被 mapping 到
- : prob of codebook entry
- : temperature, : noise
Training
Masking
- Mask 掉一部分的 feature encoder output ,(置換成學來的 mask 向量)
- 以機率 從所有 time step 選擇,並 mask 掉其後面的 個 token
Objective
Contrastive loss
- : cosine similarity
- : 1 positive q_t 和 個 negative sample。從同個 utterance 中其他被 masked 掉的 time step 選
Diversity Loss
- 簡單來說就是讓 codebook 中的每個 entries 都可以被選到。Maximize entropy
- 可以得到更好的 quantized codebook
- 如果沒有這個限制,會讓模型壞掉
- i.e. 所有 representation 都選同個 entry,contrastive loss 最低
Fine-tuning
此論文中用 pretrain 好的 wav2vec2 ,加上一層 Linear,訓練 CTC loss。
Experiments
- Pretrained dataset:
- LS-960
- LV-60K
Pretraining
- Fairseq
- p = 0.065, M=10. 49% of all time steps to be masked
Base model:
- 12 Transformer layers
- 768 model dim
- 8 head
- 3072 inner dim
- Batches: cropping 250k audio sample
- batched less than 1.4m samples per GPU
- 64 V100 GPU for 1.6 days
- Adam optimizer:
- lr:
- Base: 400k updates
- G= 2, V=320: 102.4k codewords
- d/G = 128
- K = 100
- choose the training checkpoint with lowest on the valid set
Results
Low-Resource
- The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves
a word error rate of 5.2/8.6 on the Librispeech clean/other test sets. - Joint learning discrete units and contetulaized representation clearly improves over previous work

High-Resource

Ablation study
wav2vec2 最大的不同是,將 contrastive loss 的目標設為 quantized codewords
- Baseline
- Continuous input: retain more information
- Quantized target: more robust training
- Continuous target
- overfit to background

Tools: