Month: March 2022

[Note] Understand the role of self attention for efficient speech recognition

  • Self-attention plays two role in the success of Transformer-based ASR
    • The “attention map” in self-attention module can be categorize to 2 group
      • “phonetic”(vertical) and “linguistic”(diagonal)
    • Phonetic: lower layer, extract phonologically meaningful global context
    • Linguistic: higher layer, attent to local context
    • -the phonetic variance is standardized in lower SA layers so upper SA layers can identify local linguistic features.

[Note] Understand the role of self attention for efficient speech recognition Read More »

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Read More »

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582

Motivation

  • CTC-based models are always weaker than AED models and requrire the assistance of external LM.
    • Conditional independence assumption
    • Hard to utilize contextualize information

Proposed

  • Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
    • Two method:
      • Representation learning: use CIF or PDS(LASO) to align the number of representation.

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models Read More »

[Note] A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

https://arxiv.org/abs/2201.08930

  • wav2vec2.0 has a good robustness against the domain shift, while noise robustness is still unclear.
  • First analyze the noise robustness of wav2vec2.0 via expeiments
  • Observation:
    • Wav2vec2.0 pretrained on noisy data can obtain good performance on noisy dataset, however brings performance degration on clean set.
  • Proposed: Enhanced wav2vec2.0
    • Noisy and clean speech are fed into the same encoder,

[Note] A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition Read More »

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

https://arxiv.org/abs/2202.03218

  • Self-supervised learning model(wav2vec, HuBERT) gain great success
    • Fine-tune the pretrained model is computationally expensive and does not scale well. $O(10^8)$ parameters per task.

Contributions:

  • Applying adapters to wav2vec2.0 model to reduce the number of parameters required for down-stream tasks.
    • Adapters are small trainable modules can be applied into the layers of a frozen pre-trained nedwork for a particular task.

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition Read More »