gluon-nlp
gluon-nlp copied to clipboard
Enable bias correction in AdamW when fine-tuning BERT
This should improve stability.
Mosbach, Marius, Maksym Andriushchenko, and Dietrich Klakow. "On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines." arXiv preprint arXiv:2006.04884 (2020).
Zhang, Tianyi, et al. "Revisiting Few-sample BERT Fine-tuning." arXiv preprint arXiv:2006.05987 (2020)
Let's try to rerun the training with the batch script here: https://github.com/dmlc/gluon-nlp/tree/master/tools/batch#squad-training
Basically, we just need to run the following two for SQuAD 2.0 and 1.1
# AWS Batch training with horovod on SQuAD 2.0 + FP16
bash question_answering/run_batch_squad.sh 1 2.0 submit_squad_v2_horovod_fp16.log float16
# AWS Batch training with horovod on SQuAD 1.1 + FP16
bash question_answering/run_batch_squad.sh 1 1.1 submit_squad_v1_horovod_fp16.log float16
Codecov Report
Merging #1468 (52ce2a4) into master (def0d70) will decrease coverage by
0.01%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #1468 +/- ##
==========================================
- Coverage 85.86% 85.84% -0.02%
==========================================
Files 52 52
Lines 6911 6911
==========================================
- Hits 5934 5933 -1
- Misses 977 978 +1
Impacted Files | Coverage Δ | |
---|---|---|
src/gluonnlp/data/tokenizers/yttm.py | 81.89% <0.00%> (-0.87%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update def0d70...52ce2a4. Read the comment docs.
The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1468/bertbiascorrection/index.html
test_squad2_albert_base 8903644b-13e1-4aa4-b695-e7b5f2c50c7d
test_squad2_albert_large aac428ac-4e25-48e8-8f3e-2643cbb6b95e
test_squad2_albert_xlarge bb565663-8173-45aa-9489-2dd690fd24c4
test_squad2_albert_xxlarge 38d9929c-fea2-4648-bc68-0bd4eb491ee8
test_squad2_electra_base 0eb9090a-d86b-40a6-9f1c-61e1cf034b59
test_squad2_electra_large 43fabf48-b524-499f-9d8a-2113349dcf74
test_squad2_electra_small 5c631945-ad26-4c2f-a7d3-bb8c705023a2
test_squad2_roberta_large 96d1e46f-b292-4915-a867-c724bb082585
test_squad2_uncased_bert_base 8228dd4c-27d3-4118-b682-06332db980f2
test_squad2_uncased_bert_large 22a91f7c-707e-4adf-a3d9-71286a3e165e
test_squad2_gluon_en_cased_bert_base_v1 13d38ddd-4ab6-4e60-8cae-1400d3169d4c
test_squad2_mobilebert 5377ebdc-da03-4e4e-8546-43e83643d1c0
test_squad2_albert_base c71abbd1-9ddb-465a-83a8-a257994a47a4
test_squad2_albert_large 55a10c2f-b51e-4722-b8fe-d0154ccf1124
test_squad2_albert_xlarge d3b1e954-b22e-4b30-bc3a-db3303d8de85
test_squad2_albert_xxlarge 9d8c599c-ecf2-4815-ac3c-cc853c75cddd
test_squad2_electra_base 9c10fca5-0ac6-4ec8-91ce-ebf2e0593513
test_squad2_electra_large d844645c-d56b-4549-805e-a3558d777e75
test_squad2_electra_small 8b17bb3f-ee8e-4212-92d7-59155f0c54ef
test_squad2_roberta_large e9972888-ae53-41e0-9b8f-1db8359e68c9
test_squad2_uncased_bert_base 083c431c-6e02-4a67-ab92-1e84a450df52
test_squad2_uncased_bert_large 24d40d9e-06fd-4158-90a3-1ee5da7183c1
test_squad2_gluon_en_cased_bert_base_v1 6b2c015b-5829-40b6-9435-718d3ecf46de
test_squad2_mobilebert 08e7618c-7e19-4db2-9451-09f65729272e
Yes, you can later use the following script to sync up the results.
bash question_answering/sync_batch_result.sh submit_squad_v2_horovod_fp16.log squad_v2_horovod_fp16
bash question_answering/sync_batch_result.sh submit_squad_v1_horovod_fp16.log squad_v1_horovod_fp16
After all results (part of the results) have been finished, you can parse the logs via
python3 question_answering/parse_squad_results.py --dir squad_v2_horovod_fp32
% python3 question_answering/parse_squad_results.py --dir squad_v2_horovod_fp16 1m 37s ~/src/gluon-nlp/tools/batch master ip-10-20-11-150
name best_f1 best_em best_f1_thresh best_em_thresh time_spent_in_hours
0 albert_base 81.861255 79.112272 -1.671970 -1.742718 1.139900
1 albert_large 84.904438 81.900109 -1.086745 -1.086745 3.423180
2 albert_xlarge 88.032327 85.134338 -1.625434 -1.625434 5.967083
3 albert_xxlarge 90.085053 87.155731 -2.226489 -2.226489 11.294118
4 electra_base 86.282903 83.643561 -1.848169 -2.301743 1.250153
5 electra_large 90.871907 88.461215 -1.347744 -1.347744 3.140608
6 electra_small 73.878219 71.481513 -1.548537 -1.548537 0.383728
7 gluon_en_cased_bert_base_v1 77.620289 74.757854 -1.731051 -1.731051 1.595762
8 mobilebert NaN NaN NaN NaN NaN
9 roberta_large 89.239196 86.431399 -2.168329 -2.168329 4.119268
10 uncased_bert_base 75.539014 72.702771 -1.595349 -1.850638 1.540320
11 uncased_bert_large 81.322878 78.177377 -2.056313 -2.056739 4.103469
Saving to squad_v2_horovod_fp16.csv
% python3 question_answering/parse_squad_results.py --dir squad_v1_horovod_fp16 ~/src/gluon-nlp/tools/batch master ip-10-20-11-150
name best_f1 best_em best_f1_thresh best_em_thresh time_spent_in_hours
0 albert_base 90.605130 83.964049 NaN NaN 0.745851
1 albert_large 92.574139 86.385998 NaN NaN 2.319241
2 albert_xlarge 93.836504 87.984863 NaN NaN 4.367765
3 albert_xxlarge 94.569074 88.448439 NaN NaN 7.321531
4 electra_base 92.483534 86.821192 NaN NaN 0.882092
5 electra_large 94.824761 89.631031 NaN NaN 2.216832
6 electra_small 85.263124 78.893094 NaN NaN 0.267190
7 gluon_en_cased_bert_base_v1 88.685434 81.986755 NaN NaN 1.077892
8 mobilebert NaN NaN NaN NaN NaN
9 roberta_large 94.665818 89.101230 NaN NaN 2.790591
10 uncased_bert_base 88.103126 81.201514 NaN NaN 0.979201
11 uncased_bert_large 90.691656 83.945128 NaN NaN 2.756076
Saving to squad_v1_horovod_fp16.csv
Is there any known issue with Mobilebert? I
Looks like an AMP issue or an operator issue causing AMP to continue decreasing the scale.. finetune_squad2.0.log
Yes.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Leonard Lausen [email protected] Sent: Friday, January 8, 2021 6:11:53 AM To: dmlc/gluon-nlp [email protected] Cc: Xingjian SHI [email protected]; Review requested [email protected] Subject: Re: [dmlc/gluon-nlp] Enable bias correction in AdamW when fine-tuning BERT (#1468)
Looks like an AMP issue? finetune_squad2.0.loghttps://github.com/dmlc/gluon-nlp/files/5787622/finetune_squad2.0.log
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1468#issuecomment-756775027, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3SCBUACNO3HK254UOTSY4HCTANCNFSM4VZVBFMA.
From the figure, I think the performance looks similar. If we choose to update the flags, we can upload the pretrained weights to S3 and also change the numbers in https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering.