[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Feature Encoder(CNN)

將 raw audio \( \mathcal{X} \) encode 成 latent speech representation $z_1, \cdots, z_T$

Contextualized representation(Transformers)

  • 將 $z_1, \cdots, z_T$  Mask 過後,經過 Transformers 產生 contextualized representation (看過前後)
  • Note: use relitive positional embeddings

Quantization module

  • wav2vec2.0 將目標表示為有限的 representation,比起學習還原連續的 output 有更好的效果
  • 要怎麼離散化? 如何將 $z$ 表示成離散的表示法?
  • 給定 $G$ codebooks 每本 codebook 裡面有 $V$ 個向量($e \in \mathbb{R}^{V \times d / G}$)
  • 每個 $z$ 會從每本 codebook 中選擇一個向量,並連接起來 $R^{d}$
  • 最後經過一層 Linear 轉換為  $q \in \mathbb{R}^f$
  • 怎麼訓練?
  • 使用 Gumbel softmax:可微分
  • $z$ 被 mapping 到 $l \in \mathbb{R}^{G \times V}$
  • $p_{g,v}$: prob of codebook $g$ entry $v$
  • $$p_{g,v} = \frac{\exp(l_{g,v} + n_v)/\tau}{ \sum ^V _{k=1} \exp(l_{g,k} + n_k) / \tau }$$
  • $\tau$: temperature, $n$: noise

Training

Masking

  • Mask 掉一部分的 feature encoder output $Z$,(置換成學來的 mask 向量)
  • 以機率 $p$ 從所有 time step 選擇,並 mask 掉其後面的 $M$ 個 token

Objective

$$ \mathcal{L} = \mathcal{L}_m + \alpha \mathcal{L}_d$$

Contrastive loss

$$ \mathcal{L}_m =  -\log \frac{ \exp( sim (c_t, q_t) / \kappa) }{ \sum _ {\tilde{q} \sim Q_t} \exp (sim (c_t , q_t )/ \kappa) } $$

  • $sim(a, b) = a^\top b / ||a|| ||b||$: cosine similarity
  • $Q_t$: 1 positive q_t 和 $K$ 個 negative sample。從同個 utterance 中其他被 masked 掉的 time step 選

Diversity Loss

  • 簡單來說就是讓 codebook 中的每個 entries 都可以被選到。Maximize entropy
  • 可以得到更好的 quantized codebook
  • 如果沒有這個限制,會讓模型壞掉
  • i.e. 所有 representation 都選同個 entry,contrastive loss 最低

$$\mathcal{L}_d =  \frac{1}{GV} \sum _{g=1} ^G - H(\bar{p}_g)  =  \frac{1}{GV} \sum ^G _ {g=1} \sum _{v=1} ^V \bar{p}_{g,v} \log \bar{p}_{g,v} $$

Fine-tuning

此論文中用 pretrain 好的 wav2vec2 ,加上一層 Linear,訓練 CTC loss。

Experiments

  • Pretrained dataset:
  • LS-960
  • LV-60K

Pretraining

  • Fairseq
  • p = 0.065, M=10. 49% of all time steps to be masked

Base model:

  • 12 Transformer layers
  • 768 model dim
  • 8 head
  • 3072 inner dim
  • Batches: cropping 250k audio sample
  • batched less than 1.4m samples per GPU
  • 64 V100 GPU  for 1.6 days
  • Adam optimizer:
  • lr:  $5 \times 10^{-4}$
  • Base: 400k updates
  • G= 2, V=320: 102.4k codewords
  • d/G = 128
  • K = 100
  • choose the training checkpoint with lowest $\mathcal{L}_m$ on the valid set

Results

Low-Resource

  • The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves
    a word error rate of 5.2/8.6 on the Librispeech clean/other test sets.
  • Joint learning discrete units and contetulaized representation clearly improves over previous work

High-Resource

Ablation study

wav2vec2 最大的不同是,將 contrastive loss 的目標設為 quantized codewords

  • Baseline
  • Continuous input: retain more information
  • Quantized target: more robust training
  • Continuous target
  • overfit to background

Tools: