[Note] PERT: Pre-training BERT with permuted language model
Can we use pre-training task other than MLM? * https://arxiv.org/abs/2203.06906 * Proposed: Permuted Language Model(PerLM) * Input: permute a proportion of text * Target: position of the original token Pretraining LM tasks * Masked LM * Whole word masking(wwm,): * alleviate "input information leaking" issue * Mask consecutive N-gram * e.g.…