ALBERT: A Lite BERT for self-supervised learning of language representations

DL&ML/papers

ALBERT: A Lite BERT for self-supervised learning of language representations

식피두 2021. 4. 14. 17:06

arxiv.org/abs/1909.11942

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To

arxiv.org

ALBERT에 대해 갑자기 궁금하여 빠르게 훑어 보았다.

ALBERT

Parameter Reduction Technique을 적용하여 메모리 사용량은 줄이고 학습 속도는 키웠음
- 임베딩을 좀 더 작게 분해해서 파라미터 개수를 줄임 (Factorized Embedding)
  - BERT 계열 모델로 부터 얻을 수 있는 representation의 강점은 context-dependent representation으로 부터 온다.
  - 임베딩 자체는 context-independent representation이니까, 차원을 좀 줄여도 괜찮지 않을까?
  - 기존의 O(V x H) 을 O(V x E + E x H) 로분해함으로써 파라미터 개수를 줄일 수 있음!
- 레이어 간에 파라미터를 완전히 공유시킴으로써 파라미터 개수를 크게 줄임 (Cross-Layer Params Sharing)
문장간의 일관성(coherence)를 모델링 하기 위해 새로운 로스를 도입했음 (기존의 NSP는 별로다)
- Sentence-Order Prediction (SOP)
  - BERT는 MLM과 NSP를 통해 학습이 되는데,
    - 후속 연구(Yang et al., 2019; Liu et al., 2019)들을 통해 NSP는 하나 마나다 라는 결론을 내린 연구도 있다고 함.
  - 이 논문에선 NSP가 충분히 어렵게 설계되지 못해 비효율로 이어진다고 함.
    - 어쨌든 inter-sentence modeling을 중요하게 생각하여 SOP를 제안
  - Positive Example은 BERT와 똑같지만
    - Negative Example은 Positive Example의 순서를 바꿈
SentencePiece 토크나이저 기반
MLM 학습을 위해 n-gram masking 기법을 썼음
- n-gram mask가 랜덤하게 선택 되도록 (최대 n = 3)
100만 스텝을 학습하고 나서도 overfit되는 모습이 보이지 않음
- 모델 capacity를 키움에 따라 dropout을 없애기로 함
- Szegedy et al., 2017, Li et al., 2019에서 ConvNet에서의 BatchNorm과 DropOut 조합은 성능을 낮출 수 있다는 것을 보임

참고 자료

Cross-Layer Parameter Sharing을 통해 레이어별 입/출력 간 변화가 smooth 함