[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Feature Encoder(CNN)

將 raw audio $ \mathcal{X} $ encode 成 latent speech representation $z_1, \cdots, z_T$

Contextualized representation(Transformers)

將 $z_1, \cdots, z_T$ Mask 過後，經過 Transformers 產生 contextualized representation (看過前後)
Note: use relitive positional embeddings

Quantization module

wav2vec2.0 將目標表示為有限的 representation，比起學習還原連續的 output 有更好的效果
要怎麼離散化？如何將 $z$ 表示成離散的表示法？
給定 $G$ codebooks 每本 codebook 裡面有 $V$ 個向量($e \in \mathbb{R}^{V \times d / G}$)
每個 $z$ 會從每本 codebook 中選擇一個向量，並連接起來 $R^{d}$
最後經過一層 Linear 轉換為 $q \in \mathbb{R}^f$
怎麼訓練？
使用 Gumbel softmax：可微分
$z$ 被 mapping 到 $l \in \mathbb{R}^{G \times V}$
$p_{g,v}$: prob of codebook $g$ entry $v$
$$p_{g,v} = \frac{\exp(l_{g,v} + n_v)/\tau}{ \sum ^V _{k=1} \exp(l_{g,k} + n_k) / \tau }$$
$\tau$: temperature, $n$: noise

Training

$$ \mathcal{L} = \mathcal{L}_m + \alpha \mathcal{L}_d$$

Contrastive loss

$$ \mathcal{L}_m = -\log \frac{ \exp( sim (c_t, q_t) / \kappa) }{ \sum _ {\tilde{q} \sim Q_t} \exp (sim (c_t , q_t )/ \kappa) } $$

$sim(a, b) = a^\top b / ||a|| ||b||$: cosine similarity
$Q_t$: 1 positive q_t 和 $K$ 個 negative sample。從同個 utterance 中其他被 masked 掉的 time step 選

Diversity Loss

$$\mathcal{L}_d = \frac{1}{GV} \sum _{g=1} ^G - H(\bar{p}_g) = \frac{1}{GV} \sum ^G _ {g=1} \sum _{v=1} ^V \bar{p}_{g,v} \log \bar{p}_{g,v} $$

此論文中用 pretrain 好的 wav2vec2 ，加上一層 Linear，訓練 CTC loss。

Base model:

The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves
a word error rate of 5.2/8.6 on the Librispeech clean/other test sets.
Joint learning discrete units and contetulaized representation clearly improves over previous work