UnifiedSKG The different results between eval mode and test mode.

The different results between eval mode and test mode.

Open eyuansu62 opened this issue 2 years ago • 14 comments

Why I get the different results between eval mode and test mode?

May 12 '22 14:05 eyuansu62

Hi,

Could you share the command you ran for this experiment?

May 12 '22 14:05 ChenWu98

The command is as follows:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value  --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

May 13 '22 02:05 eyuansu62

Is the highest eval score the same as the test score?

May 13 '22 02:05 ChenWu98

The ckpt I chosen is the highest eval score during the training steps. As you can see, it is different from the test score.

May 13 '22 02:05 eyuansu62

Can you run the following command on the same machine (which means that the previous checkpoints are still there) and see if the results are different?

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 0 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

May 13 '22 03:05 ChenWu98

@eyuansu62 Hi, any new progress over there? We double-checked our experiments log before and didn't find the case you showed, and we looked through the issues of PICARD and saw that you made similar issue in there too. It is very likely we are facing the same issue and same factor in your machine.

Hope we can figure that out together!

May 13 '22 14:05 Timothyxxx

They are still a little different.

May 15 '22 15:05 eyuansu62

Could you double-check the evaluation and prediction json file? It could help us with where the problem lies.

May 16 '22 07:05 Timothyxxx

I check the evaluation and prediction json file, and find they are indeed different, no matter when do_train=False or num_train_epoch=0.

The different sqls are like follows, just a few conditions are wrong: select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.concert_id where concert.year = 2014 select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.singer_id where concert.year = 2014

May 18 '22 14:05 eyuansu62

Okay, I will keep this issue active and see if anyone find similar problem!

May 18 '22 15:05 Timothyxxx

I just realized that the command you provided is for T5-3b without using deepspeed. I remember that we didn't manage to run without deepspeed even on an A100. What kind of GPU are you using, if you remember?

May 18 '22 16:05 ChenWu98

Well, it is actually t5-large in this cfg file. I forget to change the file name.

May 19 '22 03:05 eyuansu62

Hey, we asked someone else for help to test it on his side and didn't get different result between eval mode and test mode(which is consistent with ours). Therefore we think it may because the machine in your side. Could you provide more info about hardware and system then?

May 19 '22 03:05 Timothyxxx

May 19 '22 09:05 eyuansu62

UnifiedSKG UnifiedSKG copied to clipboard

The different results between eval mode and test mode.

UnifiedSKG
UnifiedSKG copied to clipboard