[Note] Understand the role of self attention for efficient speech recognition

* Self-attention plays two role in the success of Transformer-based ASR * The "attention map" in self-attention module can be categorize to 2 group * "phonetic"(vertical) and "linguistic"(diagonal) * Phonetic: lower layer, extract phonologically meaningful global context * Linguistic: higher layer, attent to local context * -> the phonetic variance is standardized in lower…

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

* https://arxiv.org/abs/2006.11477 * Self-supervised speech representation * contrastive loss: masked continous speech input -> quantized target * quantized module: gumbel softmax(latent representation codebooks) * wav2vec2.0 Large with 10min data:  5.2/8.6 LS-clean test * Fairseq * Well explained: https://neurosys.com/wav2vec-2-0-framework Feature Encoder(CNN) 將 raw audio…

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582 Motivation * CTC-based models are always weaker than AED models and requrire the assistance of external LM. * Conditional independence assumption * Hard to utilize contextualize information Proposed * Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use…