[Note] Conformer: Convolution-augmented Transformer for Speech Recognition

  • SOTA performance on Librispeech
  • A novel way to combine CNN + Transformer
    • to model both local(CNN) and global(Self-attention) dependencies
  • Conformer
    • 4 modules: 0.5 FFN + MHSA + CNN + 0.5 FFN
    • Macaron-style half-step resudual FFN
    • placing CNN after MHSA is more effective
    • swish activation led to faster convergence
  • Transformer is good at modeling long-range global context, less capable to extract local fine-grained local feature patterns.
  • CNN exploit local information, learn shared position-based kernels over a local window.

Conformer block consist of 4 modules:

  • FFN
  • multi-head Self-attention
  • CNN
  • FFN


  • relative sinusoidal positional encoding
  • prenorm residula


  • Gated linear unit(GLU)
  • Swish activation


~ Transformer block

Conformer block

over all: two FFN module with Macaron-style half-step sandwiching MHSA and CNN module.

Ablation study shows two Macaron-net style FFN with falf-step residual >> over having single FFN in Conformer


  • 80 dim F-Bank
  • SpecAugment
  • single LSTM decoder
  • 3-layer LSTM LM with LS-960

Ablation studies

Conformer v.s Transformer

  • CNN is the most important
  • Macaron-style FFN pair is also more effective than single FFN
  • Swish activation led to faster convergence

Combination of CNN and Transformer

  • replacing depthwise convolution with lightweight convolution sees a significant performance drop
  • placing the CNN before MHSA degrades the results by 0.1
  • split the input in to CNN and MHSA brach and concat their output sorsen the performance

Macaron FFN module