[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

Hank Lu - 17 Mar 2022

https://arxiv.org/abs/2203.03582

Motivation

CTC-based models are always weaker than AED models and requrire the assistance of external LM.
Conditional independence assumption
Hard to utilize contextualize information

Proposed

Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
Two method:
Representation learning: use CIF or PDS(LASO) to align the number of representation. Use cosine embedding loss(better than MSE loss)
Classification learning (Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model)

Method

Representation Learning

notation
wav2vec2.0 encoder output: $H = (h_1, \cdots, h_M)$
target labels: $T = (y_1, \cdots, y_N)$

Two Mechanism:

CIF
Integrate $h_m$ by weighted sum to collect $l_n$
restrict the length to $N$
Attention
Using one layer cross attention with positional embeddings $P$ and $H$. (LASO: PDS module)

Objective: CTC loss and cosine embedding loss

$$\mathcal{L}_{cos} = k \cdot \sum _ {n=0} ^{N} (1 - \cos (l_n , e_n))$$

$$\mathcal{L}_{mtl} = \lambda \mathcal{L}_{ctc} + (1- \lambda) \mathcal{L}_{cos}$$

Abalation study on representation learning

CIF is better than ATT, but much slower
Attention is too flexible
extract linguistic information through a weighted sum, so the knowledge to a word may transferred to other frame. (Note: Similar to my experiment)
Cosine is better than MSE
Angel may be more important

Classification Learning

Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model
Joint training with decoder can improve the performance of CTC brach
The speech-modal encoder output $H$ have to bridge the gap with text-modal.

joint training can improve performance
KT-RL outperform KT-CL

Results

CTC-based model (no inference time reduction)
proposed model w/o LM can surpass Vanilla Wav2vec model w/ LM (AR model)

Bidirectional information is benefitial

Note:

The frames which not belongs to a word are also useful information. (vs. spikes-based method)
CIF is slower during training
Angel may be more important