Token-Dropping : 25% reduction in BERT pretraining time with no performance loss

Google, NYU & Maryland U Token-Dropping Method Reduces Pretraining BERT Time by 25%

Pretraining large language models of the BERT type — which can scale to billions parameters — is essential for achieving state-of-the art performance on many NLP tasks. The pretraining process is costly and has been a bottleneck in the industrial application for large language models.

A research team from Google and New York University proposes in the paper Token Dropping For Efficient BERT pretraining a simple, but effective, \”token-dropping\” technique to reduce the cost of pretraining transformer models, such as the BERT model, without compromising performance for downstream fine tuning tasks.

The team summarises their main contributions in:


Google, NYU & Maryland U’s Token-Dropping Approach Reduces BERT Pretraining Time by 25%