ASR

[Note] Conformer: Convolution-augmented Transformer for Speech Recognition

  • SOTA performance on Librispeech
  • A novel way to combine CNN + Transformer
    • to model both local(CNN) and global(Self-attention) dependencies
  • Conformer
    • 4 modules: 0.5 FFN + MHSA + CNN + 0.5 FFN
    • Macaron-style half-step resudual FFN
    • placing CNN after MHSA is more effective
    • swish activation led to faster convergence

[Note] Conformer: Convolution-augmented Transformer for Speech Recognition Read More »

[Note] InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR

https://arxiv.org/abs/2204.00174

  • A novel training method for CTC-based ASR using augmented intermediate representations for conditioning
    • a extension of self-condition CTC
  • Methods: noisy conditioning
    • feature space: Mask time or feature
    • token space: Insert, delete, substitute token in “condition”.

[Note] InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR Read More »

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Read More »

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582

Motivation

  • CTC-based models are always weaker than AED models and requrire the assistance of external LM.
    • Conditional independence assumption
    • Hard to utilize contextualize information

Proposed

  • Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
    • Two method:
      • Representation learning: use CIF or PDS(LASO) to align the number of representation.

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models Read More »

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

https://arxiv.org/abs/2202.03218

  • Self-supervised learning model(wav2vec, HuBERT) gain great success
    • Fine-tune the pretrained model is computationally expensive and does not scale well. $O(10^8)$ parameters per task.

Contributions:

  • Applying adapters to wav2vec2.0 model to reduce the number of parameters required for down-stream tasks.
    • Adapters are small trainable modules can be applied into the layers of a frozen pre-trained nedwork for a particular task.

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition Read More »