[Note] PERT: Pre-training BERT with permuted language model

Can we use pre-training task other than MLM?

https://arxiv.org/abs/2203.06906
Proposed: Permuted Language Model(PerLM)
Input: permute a proportion of text
Target: position of the original token

Pretraining LM tasks

Masked LM
Whole word masking(wwm,):
alleviate "input information leaking" issue
Mask consecutive N-gram
e.g. SpanBERT, MacBERT

Methods

Use wwm + N-gram masking
40%, 30%, 20%, 10% of uni-gram to 4-gram
Masking 15% of input words
shuffle the position of word orders
10% unchanged, treating them as negative samples
No [mask] token.
The prediction space is the input sequence, not whole vocabulary

Given a pair of sentence $A$ and $B$, masking them in to $A^\prime$, $B^\prime$.

$$ X = \text{[CLS]}, A_1^\prime, \cdots, A_n^\prime, \text{[SEP]}, B_1^\prime ,\cdots B_m^\prime, \text{[SEP]} $$

$$ H = \textbf{PERT}(X) $$

$\tilde{H}^m$ is the candidate representations which is choosen in from previous stage. $k = \lfloor N \times 15\% \rfloor $

$$ \begin{aligned} \tilde{H}^m = \textbf{LN}(\textbf{Dropout}(\textbf{FFN}(H^m))) \\\ p_i = \text{softmax}(\tilde{H}_i^m H^\top + b), p_i \in \mathbb{R}^k \end{aligned} $$

$$ \mathcal{L} = -\frac{1}{M} \sum _{i=1}^M y_i \log p_i $$

Example

according to https://github.com/ymcui/PERT/issues/3

$$ \begin{aligned} H = [ H_我, H_欢, H_喜, H_吃, H_果, H_苹] \\\ \tilde{H}^m = [\tilde{H}^m_欢, \tilde{H}^m_喜, \tilde{H}^m_果, \tilde{H}^m_苹] \\\ \tilde{H}^m H^\top \in R^{k \times N} y_i = [2, 1, 5, 4] \end{aligned}$$

$y_i$ respect to the position of $H_{欢,喜,果,苹}$

Results on Chinese datasets

Machine Reading Comprehension

moderate improvement over MacBERT, outperform others
PERT learn both short-range and long-range text inference abilities.

Text Classification

Do not perform well
Conjecture: PerLM brings difficulties in understand short text compare to MRC tasks.

Named Entity Recognition

Consistent improvement over all baselines.

PERT yield better performance on MRC, NER but not on TC.
wwm + N-gram training make PERT more sensitive to word/phrase boundaries
TC task are more sensitive to the word permutation.
TC: is shorter, permuted word may brings meaning changing.
MRC: is longer, some word permutation may not change the narrative flows
NER: may not affect, NE only take a small proportion in the whole input text.

Word order recovery

我每天一個吃蘋果 $\rightarrow$ 我O 每O 天O 一B 個I 吃E 蘋O 果O
consistent and significant improvements over other baseline.

Analysis

PERT yield better performance on MRC, NER but not on TC.
wwm + N-gram training make PERT more sensitive to word/phrase boundaries
TC task are more sensitive to the word permutation.
TC: is shorter, permuted word may brings meaning changing.
MRC: is longer, some word permutation may not change the narrative flows
NER: may not affect, NE only take a small proportion in the whole input text.

Note

The idea came from "Permuting several Chinese characters does not affect your reading that much". I think this phenomenon is more related to "visual" but not linguistic. However, PerLM somehow performed well in other perspective.
The results only shows first glance of this model. It showed that this pretraining task can yield some positive and negative results. Haven't showed why PerLM works or why is it powerful. The authors said they will do more experiments in the future.
PerLM may be more effective on sequence tagging tasks (more clear word/phrase boundary?)
may be harmful to shorter sentences.