Table of Contents

https://arxiv.org/abs/2203.03582

**Motivation**

- CTC-based models are always weaker than AED models and requrire the assistance of external LM.
- Conditional independence assumption
- Hard to utilize contextualize information

**Proposed**

- Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
- Two method:
- Representation learning: use CIF or PDS(LASO) to align the number of representation. Use cosine embedding loss(better than MSE loss)
- Classification learning (Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model)

- Two method:

## Method

### Representation Learning

- notation
- wav2vec2.0 encoder output: $H = (h_1, \cdots, h_M)$
- target labels: $T = (y_1, \cdots, y_N)$

Two Mechanism:

- CIF
- Integrate $h_m$ by weighted sum to collect $l_n$
- restrict the length to $N$

- Attention
- Using one layer cross attention with positional embeddings $P$ and $H$. (LASO: PDS module)

Objective: CTC loss and cosine embedding loss

$$\mathcal{L}_{cos} = k \cdot \sum _ {n=0} ^{N} (1 – \cos (l_n , e_n))$$

$$\mathcal{L}_{mtl} = \lambda \mathcal{L}_{ctc} + (1- \lambda) \mathcal{L}_{cos}$$

- CIF is better than ATT, but much slower
- Attention is too flexible
- extract linguistic information through a weighted sum, so the knowledge to a word
**may transferred to other frame**. (Note: Similar to my experiment)

- Cosine is better than MSE
**Angel may be more important**

### Classification Learning

- Similar to
*Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model* **Joint training with decoder can improve the performance of CTC brach**- The speech-modal encoder output $H$ have to bridge the gap with text-modal.

- joint training can improve performance
- KT-RL outperform KT-CL

## Results

- CTC-based model (no inference time reduction)
- proposed model w/o LM can surpass Vanilla Wav2vec model w/ LM (AR model)

- Bidirectional information is benefitial

## Note:

- The frames which not belongs to a word are also useful information. (vs. spikes-based method)
- CIF is slower during training

- Angel may be more important