[Note] PERT: Pre-training BERT with permuted language model

Can we use pre-training task other than MLM?

Pretraining LM tasks

  • Masked LM
  • Whole word masking(wwm,):
  • alleviate "input information leaking" issue
  • Mask consecutive N-gram
  • e.g. SpanBERT, MacBERT

Methods

  • Use wwm + N-gram masking
  • 40%, 30%, 20%, 10% of uni-gram to 4-gram
  • Masking 15% of input words
  • shuffle the position of word orders
  • 10% unchanged, treating them as negative samples
  • No [mask] token.
  • The prediction space is the input sequence, not whole vocabulary

Given a pair of sentence $A$ and $B$, masking them in to $A^\prime$, $B^\prime$.

$$ X = \text{[CLS]}, A_1^\prime, \cdots, A_n^\prime, \text{[SEP]}, B_1^\prime ,\cdots B_m^\prime, \text{[SEP]} $$

$$ H = \textbf{PERT}(X) $$

$\tilde{H}^m$ is the candidate representations which is choosen in from previous stage. $k = \lfloor N \times 15\% \rfloor $

$$ \begin{aligned} \tilde{H}^m = \textbf{LN}(\textbf{Dropout}(\textbf{FFN}(H^m))) \\\ p_i = \text{softmax}(\tilde{H}_i^m H^\top + b), p_i \in \mathbb{R}^k \end{aligned} $$

$$ \mathcal{L} = -\frac{1}{M} \sum _{i=1}^M y_i \log p_i $$

Example

according to https://github.com/ymcui/PERT/issues/3

$$ \begin{aligned} H = [ H_我, H_欢, H_喜, H_吃, H_果, H_苹] \\\ \tilde{H}^m = [\tilde{H}^m_欢, \tilde{H}^m_喜, \tilde{H}^m_果, \tilde{H}^m_苹]  \\\ \tilde{H}^m H^\top \in R^{k \times N} y_i = [2, 1, 5, 4] \end{aligned}$$

$y_i$ respect to the position of $H_{欢,喜,果,苹}$

Results on Chinese datasets

Machine Reading Comprehension

  • moderate improvement over MacBERT, outperform others
  • PERT learn both short-range and long-range text inference abilities.

Text Classification

  • Do not perform well
  • Conjecture: PerLM brings difficulties in understand short text compare to MRC tasks.

Named Entity Recognition

  • Consistent improvement over all baselines.
  • PERT yield better performance on MRC, NER but not on TC.
  • wwm + N-gram training make PERT more sensitive to word/phrase boundaries
  • TC task are more sensitive to the word permutation.
  • TC: is shorter, permuted word may brings meaning changing.
  • MRC: is longer, some word permutation may not change the narrative flows
  • NER: may not affect, NE only take a small proportion in the whole input text.

Word order recovery

  • 我每天一個吃蘋果 $\rightarrow$ 我O 每O 天O 一B 個I 吃E 蘋O 果O
  • consistent and significant improvements over other baseline.

Analysis

  • PERT yield better performance on MRC, NER but not on TC.
  • wwm + N-gram training make PERT more sensitive to word/phrase boundaries
  • TC task are more sensitive to the word permutation.
  • TC: is shorter, permuted word may brings meaning changing.
  • MRC: is longer, some word permutation may not change the narrative flows
  • NER: may not affect, NE only take a small proportion in the whole input text.

Note

  • The idea came from "Permuting several Chinese characters does not affect your reading that much". I think this phenomenon is more related to "visual" but not linguistic. However, PerLM somehow performed well in other perspective.
  • The results only shows first glance of this model. It showed that this pretraining task can yield some positive and negative results. Haven't showed why PerLM works or why is it powerful. The authors said they will do more experiments in the future.
  • PerLM may be more effective on sequence tagging tasks (more clear word/phrase boundary?)
  • may be harmful to shorter sentences.