This should improve stability.

Mosbach, Marius, Maksym Andriushchenko, and Dietrich Klakow. "On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines." arXiv preprint arXiv:2006.04884 (2020).

Zhang, Tianyi, et al. "Revisiting Few-sample BERT Fine-tuning." arXiv preprint arXiv:2006.05987 (2020)

Jan 07 '21 23:01 leezu

Let's try to rerun the training with the batch script here: https://github.com/dmlc/gluon-nlp/tree/master/tools/batch#squad-training

Basically, we just need to run the following two for SQuAD 2.0 and 1.1

# AWS Batch training with horovod on SQuAD 2.0 + FP16
bash question_answering/run_batch_squad.sh 1 2.0 submit_squad_v2_horovod_fp16.log float16

# AWS Batch training with horovod on SQuAD 1.1 + FP16
bash question_answering/run_batch_squad.sh 1 1.1 submit_squad_v1_horovod_fp16.log float16

Jan 07 '21 23:01 sxjscience

Codecov Report

Merging #1468 (52ce2a4) into master (def0d70) will decrease coverage by 0.01%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1468      +/-   ##
==========================================
- Coverage   85.86%   85.84%   -0.02%     
==========================================
  Files          52       52              
  Lines        6911     6911              
==========================================
- Hits         5934     5933       -1     
- Misses        977      978       +1

Impacted Files	Coverage Δ
src/gluonnlp/data/tokenizers/yttm.py	`81.89% <0.00%> (-0.87%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update def0d70...52ce2a4. Read the comment docs.

Jan 07 '21 23:01 codecov[bot]

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1468/bertbiascorrection/index.html

Jan 07 '21 23:01 github-actions[bot]

test_squad2_albert_base 8903644b-13e1-4aa4-b695-e7b5f2c50c7d
test_squad2_albert_large aac428ac-4e25-48e8-8f3e-2643cbb6b95e
test_squad2_albert_xlarge bb565663-8173-45aa-9489-2dd690fd24c4
test_squad2_albert_xxlarge 38d9929c-fea2-4648-bc68-0bd4eb491ee8
test_squad2_electra_base 0eb9090a-d86b-40a6-9f1c-61e1cf034b59
test_squad2_electra_large 43fabf48-b524-499f-9d8a-2113349dcf74
test_squad2_electra_small 5c631945-ad26-4c2f-a7d3-bb8c705023a2
test_squad2_roberta_large 96d1e46f-b292-4915-a867-c724bb082585
test_squad2_uncased_bert_base 8228dd4c-27d3-4118-b682-06332db980f2
test_squad2_uncased_bert_large 22a91f7c-707e-4adf-a3d9-71286a3e165e
test_squad2_gluon_en_cased_bert_base_v1 13d38ddd-4ab6-4e60-8cae-1400d3169d4c
test_squad2_mobilebert 5377ebdc-da03-4e4e-8546-43e83643d1c0
test_squad2_albert_base c71abbd1-9ddb-465a-83a8-a257994a47a4
test_squad2_albert_large 55a10c2f-b51e-4722-b8fe-d0154ccf1124
test_squad2_albert_xlarge d3b1e954-b22e-4b30-bc3a-db3303d8de85
test_squad2_albert_xxlarge 9d8c599c-ecf2-4815-ac3c-cc853c75cddd
test_squad2_electra_base 9c10fca5-0ac6-4ec8-91ce-ebf2e0593513
test_squad2_electra_large d844645c-d56b-4549-805e-a3558d777e75
test_squad2_electra_small 8b17bb3f-ee8e-4212-92d7-59155f0c54ef
test_squad2_roberta_large e9972888-ae53-41e0-9b8f-1db8359e68c9
test_squad2_uncased_bert_base 083c431c-6e02-4a67-ab92-1e84a450df52
test_squad2_uncased_bert_large 24d40d9e-06fd-4158-90a3-1ee5da7183c1
test_squad2_gluon_en_cased_bert_base_v1 6b2c015b-5829-40b6-9435-718d3ecf46de
test_squad2_mobilebert 08e7618c-7e19-4db2-9451-09f65729272e

Jan 08 '21 00:01 leezu

Yes, you can later use the following script to sync up the results.

bash question_answering/sync_batch_result.sh submit_squad_v2_horovod_fp16.log squad_v2_horovod_fp16
bash question_answering/sync_batch_result.sh submit_squad_v1_horovod_fp16.log squad_v1_horovod_fp16

After all results (part of the results) have been finished, you can parse the logs via

python3 question_answering/parse_squad_results.py --dir squad_v2_horovod_fp32

Jan 08 '21 00:01 sxjscience

% name    best_f1 0                   albert_base  81.861255  79.112272 1                  albert_large  84.904438  81.900109 2                 albert_xlarge  88.032327  85.134338 3                albert_xxlarge  90.085053  87.155731 4                  electra_base  86.282903  83.643561 5                 electra_large  90.871907  88.461215 6                 electra_small  73.878219  71.481513 7   gluon_en_cased_bert_base_v1  77.620289  74.757854 8                    mobilebert        NaN        NaN 9                 roberta_large  89.239196  86.431399 10            uncased_bert_base  75.539014  72.702771 11           uncased_bert_large  81.322878  78.177377 Saving to squad_v2_horovod_fp16.csv

% python3 question_answering/parse_squad_results.py name    best_f1 0                   albert_base  90.605130  83.964049 1                  albert_large  92.574139  86.385998 2                 albert_xlarge  93.836504  87.984863 3                albert_xxlarge  94.569074  88.448439 4                  electra_base  92.483534  86.821192 5                 electra_large  94.824761  89.631031 6                 electra_small  85.263124  78.893094 7   gluon_en_cased_bert_base_v1  88.685434  81.986755 8                    mobilebert        NaN        NaN 9                 roberta_large  94.665818  89.101230 10            uncased_bert_base  88.103126  81.201514 11           uncased_bert_large  90.691656  83.945128 Saving to squad_v1_horovod_fp16.csv

Is there any known issue with Mobilebert? I python3 question_answering/parse_squad_results.py --dir squad_v2_horovod_fp16 1m 37s ~/src/gluon-nlp/tools/batch master ip-10-20-11-150 best_em best_f1_thresh best_em_thresh time_spent_in_hours -1.671970 -1.742718 1.139900 -1.086745 -1.086745 3.423180 -1.625434 -1.625434 5.967083 -2.226489 -2.226489 11.294118 -1.848169 -2.301743 1.250153 -1.347744 -1.347744 3.140608 -1.548537 -1.548537 0.383728 -1.731051 -1.731051 1.595762 NaN NaN NaN -2.168329 -2.168329 4.119268 -1.595349 -1.850638 1.540320 -2.056313 -2.056739 4.103469 --dir squad_v1_horovod_fp16 ~/src/gluon-nlp/tools/batch master ip-10-20-11-150 best_em best_f1_thresh best_em_thresh time_spent_in_hours NaN NaN 0.745851 NaN NaN 2.319241 NaN NaN 4.367765 NaN NaN 7.321531 NaN NaN 0.882092 NaN NaN 2.216832 NaN NaN 0.267190 NaN NaN 1.077892 NaN NaN NaN NaN NaN 2.790591 NaN NaN 0.979201 NaN NaN 2.756076

Jan 08 '21 14:01 leezu

Looks like an AMP issue or an operator issue causing AMP to continue decreasing the scale.. finetune_squad2.0.log

Jan 08 '21 14:01 leezu

Yes.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Leonard Lausen [email protected] Sent: Friday, January 8, 2021 6:11:53 AM To: dmlc/gluon-nlp [email protected] Cc: Xingjian SHI [email protected]; Review requested [email protected] Subject: Re: [dmlc/gluon-nlp] Enable bias correction in AdamW when fine-tuning BERT (#1468)

Looks like an AMP issue? finetune_squad2.0.loghttps://github.com/dmlc/gluon-nlp/files/5787622/finetune_squad2.0.log

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/pull/1468#issuecomment-756775027, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3SCBUACNO3HK254UOTSY4HCTANCNFSM4VZVBFMA.

Jan 08 '21 15:01 sxjscience

From the figure, I think the performance looks similar. If we choose to update the flags, we can upload the pretrained weights to S3 and also change the numbers in https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering.

Jan 08 '21 16:01 sxjscience

gluon-nlp gluon-nlp copied to clipboard

Enable bias correction in AdamW when fine-tuning BERT

Codecov Report

gluon-nlp
gluon-nlp copied to clipboard