[Note] A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

Table of Contents


  • wav2vec2.0 has a good robustness against the domain shift, while noise robustness is still unclear.
  • First analyze the noise robustness of wav2vec2.0 via expeiments
  • Observation:
    • Wav2vec2.0 pretrained on noisy data can obtain good performance on noisy dataset, however brings performance degration on clean set.
  • Proposed: Enhanced wav2vec2.0
    • Noisy and clean speech are fed into the same encoder, where clean features provide training target.


Architecture: Wav2vec 2.0

  • Feature encoder(CNN): $f: X \to Z$
  • transformer encoder(Transformer): $g: Z \to C$

  • Feature extraction via shared feature encoder:
    • $Z_{noisy} = f(X_{noisy})$, $Z_{clean} = f(X_{clean})$
  • Noisy (left part)
    • Mask certain proportion of noisy feature $Z_{noisy}$, replace it with learnable vector.
    • Then the transformer encoder encodes to the noisy contextulized representation
    • $C_{noisy} = g(Z_noisy)$
  • Clean (right part)
    • discretize $Z_{clean}$ to $q_{clean}$ by VQ module, which used as targets in contrastive loss.
    • $q_{clean} = VQ(Z_{clean})$
  • $L_m$: contrastive loss between $q_{clean_t}$ and $K+1$ quantized candidate.
  • $L_d$: diversity loss
  • $L_f$: l2 palnelty over the output of feature encoder
  • $L_c$: the euclidean distance between $Z_{noisy}$ and $Z_{clean}$



  • Fine-tune wav2vec2.0 model on noisy datasets can imrove performance on noisy test set, but decrease the performance on clean set.
  • Proposed: use clean data as target to pretrained on noisy set.
    • The enhanced model can improve on noisy set and also avoid degration on clean set.
    • learned better representation(cosine similarity closer to clean data)