[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582

Motivation

  • CTC-based models are always weaker than AED models and requrire the assistance of external LM.
    • Conditional independence assumption
    • Hard to utilize contextualize information

Proposed

  • Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
    • Two method:
      • Representation learning: use CIF or PDS(LASO) to align the number of representation. Use cosine embedding loss(better than MSE loss)
      • Classification learning (Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model)

Method

Representation Learning

  • notation
    • wav2vec2.0 encoder output: $H = (h_1, \cdots, h_M)$
    • target labels: $T = (y_1, \cdots, y_N)$

Two Mechanism:

  • CIF
    • Integrate $h_m$ by weighted sum to collect $l_n$
    • restrict the length to $N$
  • Attention
    • Using one layer cross attention with positional embeddings $P$ and $H$. (LASO: PDS module)

Objective: CTC loss and cosine embedding loss

$$\mathcal{L}_{cos} = k \cdot \sum _ {n=0} ^{N} (1 – \cos (l_n , e_n))$$

$$\mathcal{L}_{mtl} = \lambda \mathcal{L}_{ctc} + (1- \lambda) \mathcal{L}_{cos}$$

Abalation study on representation learning
  • CIF is better than ATT, but much slower
    • Attention is too flexible
    • extract linguistic information through a weighted sum, so the knowledge to a word may transferred to other frame. (Note: Similar to my experiment)
  • Cosine is better than MSE
    • Angel may be more important

Classification Learning

  • Similar to Improving Hybrid CTC_Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model
  • Joint training with decoder can improve the performance of CTC brach
  • The speech-modal encoder output $H$ have to bridge the gap with text-modal.
  • joint training can improve performance
  • KT-RL outperform KT-CL

Results

  • CTC-based model (no inference time reduction)
  • proposed model w/o LM can surpass Vanilla Wav2vec model w/ LM (AR model)
  • Bidirectional information is benefitial

Note:

  • The frames which not belongs to a word are also useful information. (vs. spikes-based method)
    • CIF is slower during training
  • Angel may be more important