clang8 The fine tuned T5 model on clang8

Hi. I am interested in the grammar error correction task and I tried to reproduce your result, but I got only 64 F0.5 score for conll2014 when tuning the T5-large model on clang8.

So I think there must be a problem of my setting and I am wondering whether you will share your tuned T5 model that you reported in this paper or the details of your fine tuning steps like batch size and number of training steps.

Aug 29 '21 00:08 ZHANG45

Hi, currently, we're not planning to release the model checkpoints, which would require an approval process for us.

Regarding the hyperparameter settings: I'd need to double check with my co-author who is currently OOO, but I think that for T5-large we used

tokens_per_batch: 1048576 
finetune_steps: 2000

Hope this helps.

Aug 30 '21 09:08 ekQ

ekQ

Thank you for your reply! I used the size you suggested and now I can get at most 65.16 F0.5 score.

I am considering whether there still has some problems for my setting or it's just a variance due to the random seed. If possible, could you help me to check it?

I used pytorch and T5 models offered by Huggingface (Maybe this caused the problem?). I used the original input instead of the processed input like following:

Original input:
    Hypothesis: The St. Louis Cardinals have always won.
    Premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but
Processed input: mnli hypothesis: The St. Louis Cardinals have always won. premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but

And this is my training log, due to the memory limitation of my GPU, I set max_batch_size=4096 tokens, and update parameters every 32 batchs. (maybe this causes the problem?)

Namespace(adam_betas='(0.9,0.98)', adam_eps=1e-06, arch='hf_T5', best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='/raid/zhang/RST/bert-RST/data-bin/data-clang8.t5/', dataset_impl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, do_evaluate=False, do_layer_decay=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.0, layer_decay=1.0, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format='simple', log_interval=50, lr=[0.0001], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_tokens_valid=1024, max_update=2000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=True, num_workers=1, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=True, reset_lr_scheduler=False, reset_meters=True, reset_optimizer=True, restore_file='checkpoint_last.pt', save_dir='/raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/', save_interval=1, save_interval_updates=0, seed=550, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, source_lang='src', target_lang='tgt', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', update_freq=[32], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=-1, warmup_updates=60, weight_decay=0.001)
| [src] dictionary: 32100 types
| [tgt] dictionary: 32100 types
| model hf_T5, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 737668096 (num. trained: 737668096)
| training on 1 GPUs
| max tokens per GPU = 4096 and max sentences per GPU = None
| no existing checkpoint found /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_last.pt
| loading train data for epoch 0
| loaded 2372119 examples from: /raid/zhang/RST/bert-RST/data-bin/data-clang8.t5/train.src-tgt.src
| loaded 2372119 examples from: /raid/zhang/RST/bert-RST/data-bin/data-clang8.t5/train.src-tgt.tgt
| /raid/zhang/RST/bert-RST/data-bin/data-clang8.t5/ train src-tgt 2372119 examples
| WARNING: 1 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[2323862]
| WARNING: overflow detected, setting loss scale to: 64.0
| WARNING: overflow detected, setting loss scale to: 32.0
| WARNING: overflow detected, setting loss scale to: 16.0
| WARNING: overflow detected, setting loss scale to: 8.0
| WARNING: overflow detected, setting loss scale to: 4.0
| WARNING: overflow detected, setting loss scale to: 2.0
| WARNING: overflow detected, setting loss scale to: 1.0
| WARNING: overflow detected, setting loss scale to: 0.5
| WARNING: overflow detected, setting loss scale to: 0.25
| WARNING: overflow detected, setting loss scale to: 0.125
| epoch 001:     50 / 306 loss=1.442, nll_loss=1.442, ppl=2.72, wps=4971, ups=0, wpb=125508.634, bsz=7700.683, num_updates=41, lr=6.83333e-05, gnorm=4.905, clip=0.000, oom=0.000, loss_scale=0.125, wall=1035, train_wall=992
| epoch 001:    100 / 306 loss=0.849, nll_loss=0.849, ppl=1.80, wps=5640, ups=0, wpb=125328.648, bsz=7732.835, num_updates=91, lr=8.11998e-05, gnorm=2.381, clip=0.000, oom=0.000, loss_scale=0.125, wall=2022, train_wall=1952
| epoch 001:    150 / 306 loss=0.645, nll_loss=0.645, ppl=1.56, wps=5878, ups=0, wpb=125581.128, bsz=7799.433, num_updates=141, lr=6.52328e-05, gnorm=1.601, clip=0.000, oom=0.000, loss_scale=0.125, wall=3013, train_wall=2916
| epoch 001:    200 / 306 loss=0.543, nll_loss=0.543, ppl=1.46, wps=5998, ups=0, wpb=125630.817, bsz=7790.408, num_updates=191, lr=5.60478e-05, gnorm=1.330, clip=0.000, oom=0.000, loss_scale=0.125, wall=4000, train_wall=3877
| epoch 001:    250 / 306 loss=0.481, nll_loss=0.481, ppl=1.40, wps=6069, ups=0, wpb=125542.419, bsz=7750.531, num_updates=241, lr=4.98962e-05, gnorm=1.087, clip=0.000, oom=0.000, loss_scale=0.125, wall=4985, train_wall=4836
| epoch 001:    300 / 306 loss=0.438, nll_loss=0.438, ppl=1.35, wps=6117, ups=0, wpb=125489.017, bsz=7754.110, num_updates=291, lr=4.54077e-05, gnorm=0.918, clip=0.000, oom=0.000, loss_scale=0.125, wall=5970, train_wall=5795
| epoch 001 | loss 0.435 | nll_loss 0.435 | ppl 1.35 | wps 6120 | ups 0 | wpb 125289.868 | bsz 7738.264 | num_updates 296 | lr 4.50225e-05 | gnorm 0.903 | clip 0.000 | oom 0.000 | loss_scale 0.125 | wall 6059 | train_wall 5882
| epoch 001 | valid on 'valid' subset | loss 0.584 | nll_loss 0.584 | ppl 1.50 | num_updates 296
valid
{'precision': 0.6029040404040404, 'recall': 0.2724679029957204, 'f0.5': 0.48521491718321313}
test
{'precision': 0.72434729811779, 'recall': 0.4458146487294469, 'f0.5': 0.6438903281519861}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 1 @ 296 updates) (writing took 166.43032836914062 seconds)
| epoch 002:     50 / 306 loss=0.211, nll_loss=0.211, ppl=1.16, wps=6348, ups=0, wpb=125803.431, bsz=7637.961, num_updates=347, lr=4.15825e-05, gnorm=0.183, clip=0.000, oom=0.000, loss_scale=0.125, wall=7944, train_wall=6864
| epoch 002:    100 / 306 loss=0.214, nll_loss=0.214, ppl=1.16, wps=6354, ups=0, wpb=125588.772, bsz=7766.317, num_updates=397, lr=3.88759e-05, gnorm=0.163, clip=0.000, oom=0.000, loss_scale=0.125, wall=8930, train_wall=7823
| epoch 002:    150 / 306 loss=0.213, nll_loss=0.213, ppl=1.16, wps=6355, ups=0, wpb=125474.384, bsz=7806.556, num_updates=447, lr=3.66372e-05, gnorm=0.192, clip=0.000, oom=0.000, loss_scale=0.125, wall=9915, train_wall=8781
| epoch 002:    200 / 306 loss=0.210, nll_loss=0.210, ppl=1.16, wps=6358, ups=0, wpb=125434.453, bsz=7750.716, num_updates=497, lr=3.47454e-05, gnorm=0.170, clip=0.000, oom=0.000, loss_scale=0.125, wall=10899, train_wall=9739
| epoch 002:    250 / 306 loss=0.208, nll_loss=0.208, ppl=1.16, wps=6359, ups=0, wpb=125415.151, bsz=7754.924, num_updates=547, lr=3.31194e-05, gnorm=0.152, clip=0.000, oom=0.000, loss_scale=0.250, wall=11884, train_wall=10697
| epoch 002:    300 / 306 loss=0.205, nll_loss=0.205, ppl=1.15, wps=6359, ups=0, wpb=125494.043, bsz=7750.027, num_updates=597, lr=3.17021e-05, gnorm=0.137, clip=0.000, oom=0.000, loss_scale=0.250, wall=12874, train_wall=11660
| epoch 002 | loss 0.205 | nll_loss 0.205 | ppl 1.15 | wps 6358 | ups 0 | wpb 125316.634 | bsz 7752.020 | num_updates 602 | lr 3.15702e-05 | gnorm 0.137 | clip 0.000 | oom 0.000 | loss_scale 0.250 | wall 12964 | train_wall 11748
| epoch 002 | valid on 'valid' subset | loss 0.584 | nll_loss 0.584 | ppl 1.50 | num_updates 602 | best_loss -0.485215
valid
{'precision': 0.5867768595041323, 'recall': 0.30385164051355207, 'f0.5': 0.49465861588481197}
test
{'precision': 0.7110633727175081, 'recall': 0.4833880978459292, 'f0.5': 0.6498478452930205}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 2 @ 602 updates) (writing took 208.477609872818 seconds)
| epoch 003:     50 / 306 loss=0.191, nll_loss=0.191, ppl=1.14, wps=6341, ups=0, wpb=125609.490, bsz=7769.725, num_updates=653, lr=3.03123e-05, gnorm=0.094, clip=0.000, oom=0.000, loss_scale=0.250, wall=14885, train_wall=12729
| WARNING: overflow detected, setting loss scale to: 0.125
| epoch 003:    100 / 306 loss=0.188, nll_loss=0.188, ppl=1.14, wps=6279, ups=0, wpb=125440.710, bsz=7776.080, num_updates=702, lr=2.92353e-05, gnorm=0.091, clip=0.000, oom=0.000, loss_scale=0.125, wall=15872, train_wall=13690
| epoch 003:    150 / 306 loss=0.188, nll_loss=0.188, ppl=1.14, wps=6307, ups=0, wpb=125430.607, bsz=7756.427, num_updates=752, lr=2.82466e-05, gnorm=0.083, clip=0.000, oom=0.000, loss_scale=0.125, wall=16858, train_wall=14649
| epoch 003:    200 / 306 loss=0.186, nll_loss=0.186, ppl=1.14, wps=6323, ups=0, wpb=125523.935, bsz=7746.790, num_updates=802, lr=2.7352e-05, gnorm=0.081, clip=0.000, oom=0.000, loss_scale=0.125, wall=17845, train_wall=15609
| epoch 003:    250 / 306 loss=0.186, nll_loss=0.186, ppl=1.14, wps=6327, ups=0, wpb=125459.428, bsz=7767.608, num_updates=852, lr=2.65372e-05, gnorm=0.078, clip=0.000, oom=0.000, loss_scale=0.125, wall=18832, train_wall=16569
| epoch 003:    300 / 306 loss=0.184, nll_loss=0.184, ppl=1.14, wps=6332, ups=0, wpb=125502.963, bsz=7765.113, num_updates=902, lr=2.57912e-05, gnorm=0.073, clip=0.000, oom=0.000, loss_scale=0.125, wall=19821, train_wall=17531
| epoch 003 | loss 0.184 | nll_loss 0.184 | ppl 1.14 | wps 6333 | ups 0 | wpb 125321.672 | bsz 7750.341 | num_updates 907 | lr 2.57201e-05 | gnorm 0.073 | clip 0.000 | oom 0.000 | loss_scale 0.125 | wall 19911 | train_wall 17618
| epoch 003 | valid on 'valid' subset | loss 0.582 | nll_loss 0.582 | ppl 1.50 | num_updates 907 | best_loss -0.494659
valid
{'precision': 0.5830238726790451, 'recall': 0.3135520684736091, 'f0.5': 0.4975101856043459}
test
{'precision': 0.7037037037037037, 'recall': 0.4896551724137931, 'f0.5': 0.6471265470593879}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 3 @ 907 updates) (writing took 439.15315651893616 seconds)
| epoch 004:     50 / 306 loss=0.177, nll_loss=0.177, ppl=1.13, wps=6357, ups=0, wpb=125509.843, bsz=7917.961, num_updates=958, lr=2.50261e-05, gnorm=0.078, clip=0.000, oom=0.000, loss_scale=0.125, wall=22113, train_wall=18597
| epoch 004:    100 / 306 loss=0.177, nll_loss=0.177, ppl=1.13, wps=6354, ups=0, wpb=125535.901, bsz=7772.495, num_updates=1008, lr=2.43975e-05, gnorm=0.080, clip=0.000, oom=0.000, loss_scale=0.125, wall=23102, train_wall=19558
| epoch 004:    150 / 306 loss=0.176, nll_loss=0.176, ppl=1.13, wps=6357, ups=0, wpb=125492.728, bsz=7728.675, num_updates=1058, lr=2.3814e-05, gnorm=0.078, clip=0.000, oom=0.000, loss_scale=0.125, wall=24087, train_wall=20518
| epoch 004:    200 / 306 loss=0.175, nll_loss=0.175, ppl=1.13, wps=6352, ups=0, wpb=125509.726, bsz=7789.761, num_updates=1108, lr=2.32705e-05, gnorm=0.076, clip=0.000, oom=0.000, loss_scale=0.125, wall=25077, train_wall=21480
| epoch 004:    250 / 306 loss=0.174, nll_loss=0.174, ppl=1.13, wps=6350, ups=0, wpb=125511.920, bsz=7804.359, num_updates=1158, lr=2.27626e-05, gnorm=0.073, clip=0.000, oom=0.000, loss_scale=0.125, wall=26067, train_wall=22443
| epoch 004:    300 / 306 loss=0.174, nll_loss=0.174, ppl=1.13, wps=6348, ups=0, wpb=125478.854, bsz=7763.130, num_updates=1208, lr=2.22865e-05, gnorm=0.069, clip=0.000, oom=0.000, loss_scale=0.250, wall=27056, train_wall=23406
| epoch 004 | loss 0.174 | nll_loss 0.174 | ppl 1.13 | wps 6348 | ups 0 | wpb 125316.634 | bsz 7752.020 | num_updates 1213 | lr 2.22405e-05 | gnorm 0.069 | clip 0.000 | oom 0.000 | loss_scale 0.250 | wall 27147 | train_wall 23494
| epoch 004 | valid on 'valid' subset | loss 0.591 | nll_loss 0.591 | ppl 1.51 | num_updates 1213 | best_loss -0.49751
valid
{'precision': 0.5836392239119035, 'recall': 0.317546362339515, 'f0.5': 0.4998652654271086}
test
{'precision': 0.7076843198338525, 'recall': 0.49473684210526314, 'f0.5': 0.6515919303948752}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 4 @ 1213 updates) (writing took 161.88716459274292 seconds)
| epoch 005:     50 / 306 loss=0.169, nll_loss=0.169, ppl=1.12, wps=6314, ups=0, wpb=125422.588, bsz=7729.412, num_updates=1264, lr=2.17872e-05, gnorm=0.052, clip=0.000, oom=0.000, loss_scale=0.250, wall=29014, train_wall=24478
| epoch 005:    100 / 306 loss=0.169, nll_loss=0.169, ppl=1.12, wps=6316, ups=0, wpb=125484.683, bsz=7849.347, num_updates=1314, lr=2.13687e-05, gnorm=0.064, clip=0.000, oom=0.000, loss_scale=0.250, wall=30008, train_wall=25444
| epoch 005:    150 / 306 loss=0.169, nll_loss=0.169, ppl=1.12, wps=6316, ups=0, wpb=125456.093, bsz=7797.192, num_updates=1364, lr=2.09734e-05, gnorm=0.062, clip=0.000, oom=0.000, loss_scale=0.250, wall=31000, train_wall=26410
| epoch 005:    200 / 306 loss=0.167, nll_loss=0.167, ppl=1.12, wps=6320, ups=0, wpb=125496.721, bsz=7759.443, num_updates=1414, lr=2.05992e-05, gnorm=0.059, clip=0.000, oom=0.000, loss_scale=0.250, wall=31993, train_wall=27376
| epoch 005:    250 / 306 loss=0.167, nll_loss=0.167, ppl=1.12, wps=6323, ups=0, wpb=125524.960, bsz=7730.773, num_updates=1464, lr=2.02444e-05, gnorm=0.077, clip=0.000, oom=0.000, loss_scale=0.250, wall=32985, train_wall=28342
| epoch 005:    300 / 306 loss=0.167, nll_loss=0.167, ppl=1.12, wps=6322, ups=0, wpb=125487.010, bsz=7755.023, num_updates=1514, lr=1.99073e-05, gnorm=0.073, clip=0.000, oom=0.000, loss_scale=0.250, wall=33976, train_wall=29306
| epoch 005 | loss 0.167 | nll_loss 0.167 | ppl 1.12 | wps 6321 | ups 0 | wpb 125316.634 | bsz 7752.020 | num_updates 1519 | lr 1.98745e-05 | gnorm 0.073 | clip 0.000 | oom 0.000 | loss_scale 0.250 | wall 34068 | train_wall 29395
| epoch 005 | valid on 'valid' subset | loss 0.597 | nll_loss 0.597 | ppl 1.51 | num_updates 1519 | best_loss -0.499865
valid
{'precision': 0.5812629399585921, 'recall': 0.3203994293865906, 'f0.5': 0.49986646488026354}
test
{'precision': 0.7034972123669538, 'recall': 0.5012639942217407, 'f0.5': 0.6509708282525092}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 5 @ 1519 updates) (writing took 164.57595419883728 seconds)
| epoch 006:     50 / 306 loss=0.164, nll_loss=0.164, ppl=1.12, wps=6306, ups=0, wpb=125490.549, bsz=7855.961, num_updates=1570, lr=1.95491e-05, gnorm=0.051, clip=0.000, oom=0.000, loss_scale=0.250, wall=35970, train_wall=30380
| epoch 006:    100 / 306 loss=0.163, nll_loss=0.163, ppl=1.12, wps=6329, ups=0, wpb=125513.980, bsz=7729.406, num_updates=1620, lr=1.9245e-05, gnorm=0.056, clip=0.000, oom=0.000, loss_scale=0.250, wall=36958, train_wall=31342
| epoch 006:    150 / 306 loss=0.163, nll_loss=0.163, ppl=1.12, wps=6334, ups=0, wpb=125458.669, bsz=7767.086, num_updates=1670, lr=1.89547e-05, gnorm=0.054, clip=0.000, oom=0.000, loss_scale=0.250, wall=37946, train_wall=32303
| epoch 006:    200 / 306 loss=0.162, nll_loss=0.162, ppl=1.12, wps=6340, ups=0, wpb=125493.786, bsz=7772.368, num_updates=1720, lr=1.86772e-05, gnorm=0.092, clip=0.000, oom=0.000, loss_scale=0.500, wall=38934, train_wall=33264
| epoch 006:    250 / 306 loss=0.162, nll_loss=0.162, ppl=1.12, wps=6345, ups=0, wpb=125448.562, bsz=7747.116, num_updates=1770, lr=1.84115e-05, gnorm=0.088, clip=0.000, oom=0.000, loss_scale=0.500, wall=39917, train_wall=34221
| epoch 006:    300 / 306 loss=0.162, nll_loss=0.162, ppl=1.12, wps=6351, ups=0, wpb=125493.309, bsz=7759.595, num_updates=1820, lr=1.81568e-05, gnorm=0.081, clip=0.000, oom=0.000, loss_scale=0.500, wall=40903, train_wall=35179
| epoch 006 | loss 0.162 | nll_loss 0.162 | ppl 1.12 | wps 6351 | ups 0 | wpb 125316.634 | bsz 7752.020 | num_updates 1825 | lr 1.81319e-05 | gnorm 0.081 | clip 0.000 | oom 0.000 | loss_scale 0.500 | wall 40993 | train_wall 35267
| epoch 006 | valid on 'valid' subset | loss 0.604 | nll_loss 0.604 | ppl 1.52 | num_updates 1825 | best_loss -0.499866
valid
{'precision': 0.5770609318996416, 'recall': 0.3215406562054208, 'f0.5': 0.49792347795352127}
test
{'precision': 0.6997981836528759, 'recall': 0.5001803101334295, 'f0.5': 0.6480702738061863}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_last.pt (epoch 6 @ 1825 updates) (writing took 167.02399063110352 seconds)
| epoch 007:     50 / 306 loss=0.160, nll_loss=0.160, ppl=1.12, wps=6363, ups=0, wpb=125679.392, bsz=7945.255, num_updates=1876, lr=1.78838e-05, gnorm=0.049, clip=0.000, oom=0.000, loss_scale=0.500, wall=42891, train_wall=36245
| epoch 007:    100 / 306 loss=0.158, nll_loss=0.158, ppl=1.12, wps=6372, ups=0, wpb=125635.713, bsz=7775.604, num_updates=1926, lr=1.76501e-05, gnorm=0.049, clip=0.000, oom=0.000, loss_scale=0.500, wall=43875, train_wall=37202
| epoch 007:    150 / 306 loss=0.159, nll_loss=0.159, ppl=1.12, wps=6381, ups=0, wpb=125584.344, bsz=7765.457, num_updates=1976, lr=1.74254e-05, gnorm=0.053, clip=0.000, oom=0.000, loss_scale=0.500, wall=44856, train_wall=38156
| epoch 007 | loss 0.159 | nll_loss 0.159 | ppl 1.12 | wps 6382 | ups 0 | wpb 125561.977 | bsz 7709.714 | num_updates 2000 | lr 1.73205e-05 | gnorm 0.052 | clip 0.000 | oom 0.000 | loss_scale 0.500 | wall 45327 | train_wall 38615
| epoch 007 | valid on 'valid' subset | loss 0.609 | nll_loss 0.609 | ppl 1.53 | num_updates 2000 | best_loss -0.499866
valid
{'precision': 0.5835502342529932, 'recall': 0.31982881597717544, 'f0.5': 0.5009384216641344}
test
{'precision': 0.6967545638945233, 'recall': 0.4967462039045553, 'f0.5': 0.6448282335273138}
| saved checkpoint /raid/zhang/RST/bert-RST/GEC_T5_checkpoints/550/checkpoint_best.pt (epoch 7 @ 2000 updates) (writing took 300.5636522769928 seconds)
| done training in 46334.1 seconds

Sep 02 '21 08:09 ZHANG45

hi @ZHANG45 I had a quick look at your commands and it looks like you are using T5 1.0 whereas in the paper t5 1.1 was used. Could you try again using a T5 1.1 checkpoint.

Other differences:

The use of float32 v float16.

Adam v AdaFactor

Sep 06 '21 09:09 Jmallins

I think you already started overfitting. Your loss on train is 0.159 but on valid 0.609. Our run stopped with a loss of 0.356

Here is an exhaustive list of hparams. I hope this helps.

# Macros:
# ==============================================================================
d_ff = 2816
d_kv = 64
d_model = 1024
dropout_rate = 0.1
MIXTURE_NAME = 'clang8.en'
num_heads = 16
num_layers = 24

# Parameters for adafactor_decay_rate_pow:
# ==============================================================================
adafactor_decay_rate_pow.exponent = 0.8
adafactor_decay_rate_pow.offset = 0

# Parameters for AdafactorOptimizer:
# ==============================================================================
AdafactorOptimizer.beta1 = 0.0
AdafactorOptimizer.clipping_threshold = 1.0
AdafactorOptimizer.decay_rate = None
AdafactorOptimizer.epsilon1 = 1e-30
AdafactorOptimizer.epsilon2 = 0.001
AdafactorOptimizer.exclude_from_parameter_scale = None
AdafactorOptimizer.factored = True
AdafactorOptimizer.min_dim_size_to_factor = 128
AdafactorOptimizer.multiply_by_parameter_scale = True
AdafactorOptimizer.stacked_dim_names = None

# Parameters for Bitransformer:
# ==============================================================================
Bitransformer.shared_embedding = True

# Parameters for constant_learning_rate:
# ==============================================================================
constant_learning_rate.learning_rate = 0.001

# Parameters for decoder/DenseReluDense:
# ==============================================================================
decoder/DenseReluDense.activation = ['gelu', 'linear']
decoder/DenseReluDense.dropout_rate = %dropout_rate
decoder/DenseReluDense.hidden_size = %d_ff
decoder/DenseReluDense.use_bias = False

# Parameters for encoder/DenseReluDense:
# ==============================================================================
encoder/DenseReluDense.activation = ['gelu', 'linear']
encoder/DenseReluDense.dropout_rate = %dropout_rate
encoder/DenseReluDense.hidden_size = %d_ff
encoder/DenseReluDense.use_bias = False

# Parameters for decoder/EncDecAttention:
# ==============================================================================
decoder/EncDecAttention.relative_attention_type = None

# Parameters for get_variable_dtype:
# ==============================================================================
get_variable_dtype.activation_dtype = 'bfloat16'

# Parameters for run:
# ==============================================================================
run.batch_size = ('tokens_per_batch', 1048576)
run.iterations_per_loop = 100
run.learning_rate_schedule = @learning_rate_schedules.constant_learning_rate
run.mode = 'train'
run.model_type = 'bitransformer'
run.optimizer = @optimize.AdafactorOptimizer
run.output_eval_examples = True
run.perplexity_eval_steps = 10
run.predict_fn = None
run.save_checkpoints_steps = 2400
run.seen_data_init_step = 0
run.sequence_length = {'inputs': 128, 'targets': 128}
run.skip_seen_data = False
run.train_steps = 1002000
run.variable_filter = None

Sep 06 '21 14:09 casaro

Hi! Jmallins and casaro. Thank you for your valuable information. I tried the setting you mentioned, but my valid loss was still larger than 0.6.

So I checked the clang8 , Conll2013 and Conll2014 dataset. And I found that, for clang8, the tokenization for puctuations was considered as a kind of grammar error, for example

Input sentence: oral test~
Target sentence: oral test ~

But for Conll2013 and Conll2014 dataset, this tokenization for puctionans was not considered as a grammar error. (For Conll2013, I use the file /revised/data/official-preprocessed.m2. And for Conll2014, I use the file noalt/official-2014.combined.m2)

Conll2013 
S Misha ( 2008 ) explains in his book 'McMafia ' that the shadow market accounts for 20 % of the global economy .


Conll2014
S Depending on whom the genetic risk was discovered from : the son or daughter for example , the parents who knows about this would decide who gets told and whom takes the precedence in being 'allowed ' to tell .
A 18 19|||Nn|||parentparent|||REQUIRED|||-NONE-|||0
A 2 3|||Pform|||who|||REQUIRED|||-NONE-|||1
A 8 9|||Prep|||in|||REQUIRED|||-NONE-|||1
A 14 14|||Mec|||,|||REQUIRED|||-NONE-|||1
A 20 21|||SVA|||know|||REQUIRED|||-NONE-|||1
A 29 30|||Pform|||who|||REQUIRED|||-NONE-|||1
A 31 32|||ArtOrDet||||||REQUIRED|||-NONE-|||1

My reproduced T5 model tends to do the puctuation tokenization and this causes a high valid loss. I want to know whether you ignored the puctuation tokenization for T5 during prediction. (If your T5 also do this tokenization during prediction, I think the high valid loss for my T5 model is caused by the device I used.)

Sep 19 '21 02:09 ZHANG45

Are you using the default flag value of --tokenize_text=True? This should ensure that the tokenization is consistent between sources and targets (although I haven't checked if the spaCy tokenizer will add a space before ~).

Sep 20 '21 13:09 ekQ

ekQ, thank you for your reply! I used the default flag vaue --tokenize_text=True as you set in run.sh

I also checked the file targets/clang8_en.detokenized.tsv, and I found some sentences are still tokenized, for example,

722081 214850 0 False oral test ~

But their corresponding sources in the file lang-8-20111007-L1-v2.dat are detokenized.

I think the inconsistent tokenization of my reproduced clang8 between sources and targets may be caused by this problem instead of the spaCy tokenizer.

Sep 21 '21 02:09 ZHANG45

I see, it looks like the detokenizer we used hasn't removed spaces before ~ characters, causing a small tokenization inconsistency between our sources and targets. Your performance could go up if you fixed this.

However, this is the data we used for the models evaluated in the paper so the discrepancy between your numbers and the ones reported in the paper is likely caused by some other difference in our T5 training setups and/or the overfitting issue that casaro mentioned.

Sep 21 '21 09:09 ekQ

maybe you could try evaluating on bea? to ensure the evaluation procedure is the same.

Sep 21 '21 14:09 Jmallins

ekQ and Jmallins , Thank you for your reply and suggestions.

I can achieve 72.31 F0.5 score on bea test data by my reproduced T5-large. So I think my high valid loss is caused by the tokenization problem for the Conll dataset.

Sep 24 '21 06:09 ZHANG45

I also face this issue. My F0.5 score on the BEA-2019 Test set is 72.26 while the score on CoNLL-2014 is 65.07. However, I don't think that tokenization is the problem, I don't see any tokenization error when I observe my CoNLL-2014 evaluation. Are there any special steps that need to be done to reproduce the CoNLL-2014 score? Thank you!

Nov 28 '21 17:11 mrqorib

We've looked more closely into the issue and found out that a likely cause for the differences in CoNLL-2014 scores is the selection of the target file: we used alt/official-2014.combined-withalt.m2 from https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz

EDIT (2022-07-19): We've looked into the issue even more closely and found out that we used noalt/official-2014.combined.m2 (which is the correct eval set used in most existing works) after all. The reason for the confusion is that we used a wrapper code built around the M2 scorer and the wrapper includes some post-processing steps that yield an improvement that's comparable to using the alt file without post processing.

I've uploaded a simplified version of the post-processing steps that fix tokenization discrepancies as retokenize.py. Running this script on the model outputs improves the F0.5 scores by about 2.5 points (for T5 xxl).

Dec 10 '21 17:12 ekQ

hi, @ekQ

I am finetuning my T5-model(Huggingface) on the cLang-8 dataset.

How can I set my hyperparameters as below?

tokens_per_batch: 1048576 finetune_steps: 2000

Jan 12 '22 02:01 lukliz

hi, in case there's not an option to directly specify tokens_per_batch, you can alternatively set batch size to 1048576 / 128 = 8192 sequences.

Jan 12 '22 10:01 ekQ

Hi @ZHANG45, would you mind sharing the training script and setup of your huggingface reproduction? Even better if you can also share the trained weight. I was able to reproduce this work using the original T5 code, but on huggingface my experiment converged at a lower score.

Mar 01 '22 22:03 mrqorib

Hi, @mrqorib. I am sorry that I reproduced this work by fairseq.

Mar 13 '22 08:03 ZHANG45

Hi @ZHANG45 , fairseq is fine too. Would you mind sharing your code? It would be very helpful to me.

Mar 14 '22 14:03 mrqorib

@mrqorib This is setting I used. But you need to edit the save_path, data_dir and arch by yourself. Additionally, you need to add a new arch hf_t5 according to this file https://github.com/pytorch/fairseq/blob/main/fairseq/models/huggingface/hf_gpt2.py. And don't forget to load pre-trained T5 before training. Because I used the "translation" task, you also need to pre-process the clang8 dataset into the corresponding src and tgt data.

#!/usr/bin/env bash
TOTAL_UPDATES=1002000    # Total number of training steps  1002000
PEAK_LR=0.001          # Peak learning rate, adjust as needed
MAX_TOKENS=8192        # Number of sequences per batch (batch size)
UPDATE_FREQ=128          # Increase the batch size 128x

for SEED in 4321  ; do
SAVE_PATH=/Model/

mkdir -p $SAVE_PATH

DATA_DIR=/data-clang8/

CUDA_VISIBLE_DEVICES=0 python fairseq/train.py $DATA_DIR \
    --seed $SEED \
    --update-ordered-indices-seed \
    --update-epoch-batch-itr True \
    --no-epoch-checkpoints \
    --reset-optimizer --reset-dataloader --reset-meters \
    --task translation -s src -t tgt --criterion label_smoothed_cross_entropy \
    --arch hf_T5 \
    --max-source-positions 128 \
    --max-target-positions 128 \
    --optimizer adafactor \
    --lr $PEAK_LR \
    --update-freq $UPDATE_FREQ \
    --max-tokens $MAX_TOKENS \
    --save-dir $SAVE_PATH \
    --max-tokens-valid 512 \
    --save-interval-updates 2400 \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 #> $SAVE_PATH/log.txt

done

Mar 25 '22 02:03 ZHANG45

Hi @ZHANG45 thanks for sharing the scripts and the detailed information.

May I know how did you load the pre-trained T5 model before training?

Also, may I know if the data in $DATA_DIR represents the binarized data generated by the fairseq-preprocess command? If so how did you get the vocabulary file that needed to pass to fairseq-preprocess?

Apr 04 '22 14:04 MichaelCaohn

Hi @ZHANG45 thanks for sharing the scripts and the detailed information.

May I know how did you load the pre-trained T5 model before training?

Also, may I know if the data in $DATA_DIR represents the binarized data generated by the fairseq-preprocess command? If so how did you get the vocabulary file that needed to pass to fairseq-preprocess?

@MichaelCaohn Hi, I also reproduced this work in fairseq, so maybe I can answer your questions. For your first problem, you need to create a wrapper for t5-arch in fairseq, and then reload their huggingface-format weights into fairseq as follows (this is my example for BART):

# in fairseq/trainer.py
if getattr(self.args, "bart_model_file_from_transformers", None) is not None:
      bart_model_file_from_transformers = self.args.bart_model_file_from_transformers
      model = BartForConditionalGeneration.from_pretrained(bart_model_file_from_transformers)
      self.get_model().load_bart_state_dict_from_transformers(
          model, strict=True, args=self.args
      )
      logger.info(
          "loaded bart parameters from " + bart_model_file_from_transformers
      )

# in your wrapper file
  def load_bart_state_dict_from_transformers(self, model, strict=True, args=None):
      new_state_dict = {}
      for k, v in model.named_parameters():
          new_state_dict[k.replace("model.", "")] = v
      # Share all embeddings
      shared_weight = self._get_resized_embeddings(new_state_dict["shared.weight"])
      new_state_dict["encoder.embed_tokens.weight"] = new_state_dict["decoder.embed_tokens.weight"] = new_state_dict["decoder.output_projection.weight"] = shared_weight
      del new_state_dict["shared.weight"]
      del model
      return super().load_state_dict(new_state_dict, True)

See https://github.com/pytorch/fairseq/issues/2666 for more details.

The data in $DATA_DIR is the binarized data generated by the fairseq-preprocess command. You can convert the vocabulary file of T5 from https://huggingface.co/t5-base/resolve/main/spiece.model, which is a serialized file. You should reload it in transformers and resave it in the format of standard fairseq vocab, i.e., each line is composed of a token and its frequency (you can set a dummy value for this).

Apr 05 '22 04:04 HillZhang1999

Hi @HillZhang1999, Thank you very much for the suggestion, I will try it out.

Apr 08 '22 15:04 MichaelCaohn

Hi, @ekQ When I finetune T5 on clang8, should I ensure that the training set and validation set both come from clang8 ? For example, I split clang8 dataset into a training set containing 98% sentence-pairs and a validation set containing the rest dataset.

Apr 30 '22 08:04 DarlingJOJO

Those of you who've had difficulties reproducing the CoNLL-14 results, please note the updated https://github.com/google-research-datasets/clang8/issues/3#issuecomment-991151706. In short, you should post-process your model outputs with retokenize.py to fix some tokenization discrepancies and only then compute the F0.5 score, using the noalt targets.

Jul 19 '22 18:07 ekQ

Hi, currently, we're not planning to release the model checkpoints, which would require an approval process for us.

Regarding the hyperparameter settings: I'd need to double check with my co-author who is currently OOO, but I think that for T5-large we used
tokens_per_batch: 1048576 
finetune_steps: 2000
Hope this helps.

Hi, Thanks for your work. And I want to fine-tune the mT5-small base large 3b models on German. And I tried a lots of times, but never succeed. I thinks it is because the hyper-parameters. I have already know the hyper-parameters in the paper. But I am confused about batch-size learning_rate . Could you please check it for me? Thanks for your help!!!!!

Jan 12 '23 02:01 BinLiang2021

clang8 clang8 copied to clipboard

The fine tuned T5 model on clang8

clang8
clang8 copied to clipboard