Speech

[Note] Conformer: Convolution-augmented Transformer for Speech Recognition

  • SOTA performance on Librispeech
  • A novel way to combine CNN + Transformer
    • to model both local(CNN) and global(Self-attention) dependencies
  • Conformer
    • 4 modules: 0.5 FFN + MHSA + CNN + 0.5 FFN
    • Macaron-style half-step resudual FFN
    • placing CNN after MHSA is more effective
    • swish activation led to faster convergence

[Note] Conformer: Convolution-augmented Transformer for Speech Recognition Read More »

[Note] Understand the role of self attention for efficient speech recognition

  • Self-attention plays two role in the success of Transformer-based ASR
    • The “attention map” in self-attention module can be categorize to 2 group
      • “phonetic”(vertical) and “linguistic”(diagonal)
    • Phonetic: lower layer, extract phonologically meaningful global context
    • Linguistic: higher layer, attent to local context
    • -the phonetic variance is standardized in lower SA layers so upper SA layers can identify local linguistic features.

[Note] Understand the role of self attention for efficient speech recognition Read More »

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

[Note] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Read More »

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

https://arxiv.org/abs/2203.03582

Motivation

  • CTC-based models are always weaker than AED models and requrire the assistance of external LM.
    • Conditional independence assumption
    • Hard to utilize contextualize information

Proposed

  • Transfer the knowledge of pretrained language model(BERT, GPT-2) to CTC-based ASR model. No inference speed reduction, only use CTC branch to decode.
    • Two method:
      • Representation learning: use CIF or PDS(LASO) to align the number of representation.

[Note] Improving CTC-based speech recognition via knowledge transferring from pre-trained language models Read More »

[Note] A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

https://arxiv.org/abs/2201.08930

  • wav2vec2.0 has a good robustness against the domain shift, while noise robustness is still unclear.
  • First analyze the noise robustness of wav2vec2.0 via expeiments
  • Observation:
    • Wav2vec2.0 pretrained on noisy data can obtain good performance on noisy dataset, however brings performance degration on clean set.
  • Proposed: Enhanced wav2vec2.0
    • Noisy and clean speech are fed into the same encoder,

[Note] A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition Read More »

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

https://arxiv.org/abs/2202.03218

  • Self-supervised learning model(wav2vec, HuBERT) gain great success
    • Fine-tune the pretrained model is computationally expensive and does not scale well. $O(10^8)$ parameters per task.

Contributions:

  • Applying adapters to wav2vec2.0 model to reduce the number of parameters required for down-stream tasks.
    • Adapters are small trainable modules can be applied into the layers of a frozen pre-trained nedwork for a particular task.

[Note] Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition Read More »

Tomofun 狗音辨識挑戰賽:初賽資料處理與模型(Top10%)

初賽靠著隊友 Carry 進決賽,決賽的時候主要負責 MLOps 的部份,分成兩篇文章來分別描述一下初賽時我的方法以及決賽時我們怎麼處理多出來的難關 — 在 AWS 上進行 Incremental training。

本實驗的貢獻:沒有用額外的資料集,也沒有 Pre-trained 模型,只將主辦單位提供的資料做 Augmentation & Pseudo Labeling 的技巧, 用 ResNet18 就獲得不錯的 Baseline 成績(Top 10%)。

Code: kehanlu/Tomofun-DogSound-recognition

Tomofun 狗音辨識挑戰賽:初賽資料處理與模型(Top10%) Read More »