[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582

Motivation

  • CTC-based models are always weaker than AED models and requrire the assistance of external LM.
  • Conditional independence assumption
  • Hard to utilize contextualize information

Proposed

  • Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
  • Two method:
  • Representation learning: use CIF or PDS(LASO) to align the number of representation. Use cosine embedding loss(better than MSE loss)
  • Classification learning (Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model)

Method

Representation Learning

  • notation
  • wav2vec2.0 encoder output: $H = (h_1, \cdots, h_M)$
  • target labels: $T = (y_1, \cdots, y_N)$

Two Mechanism:

  • CIF
  • Integrate $h_m$ by weighted sum to collect $l_n$
  • restrict the length to $N$
  • Attention
  • Using one layer cross attention with positional embeddings $P$ and $H$. (LASO: PDS module)

Objective: CTC loss and cosine embedding loss

$$\mathcal{L}_{cos} = k \cdot \sum _ {n=0} ^{N} (1 - \cos (l_n , e_n))$$

$$\mathcal{L}_{mtl} = \lambda \mathcal{L}_{ctc} + (1- \lambda) \mathcal{L}_{cos}$$

Abalation study on representation learning
  • CIF is better than ATT, but much slower
  • Attention is too flexible
  • extract linguistic information through a weighted sum, so the knowledge to a word may transferred to other frame. (Note: Similar to my experiment)
  • Cosine is better than MSE
  • Angel may be more important

Classification Learning

  • Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model
  • Joint training with decoder can improve the performance of CTC brach
  • The speech-modal encoder output $H$ have to bridge the gap with text-modal.
  • joint training can improve performance
  • KT-RL outperform KT-CL

Results

  • CTC-based model (no inference time reduction)
  • proposed model w/o LM can surpass Vanilla Wav2vec  model w/ LM (AR model)
  • Bidirectional information is benefitial

Note:

  • The frames which not belongs to a word are also useful information. (vs. spikes-based method)
  • CIF is slower during training
  • Angel may be more important