DeepLearningExamples
DeepLearningExamples copied to clipboard
[BERT/PyTorch] How to get accuracy in prediciton mode?
Related to BERT/PyTorch
Describe the bug A clear and concise description of what the bug is. I want to get exact_match and f1 score when doing prediction. I changed some codes in scripts/run_squad.sh for using evaluate-v1.1.py. But, I got some score and None after it.
To Reproduce Steps to reproduce the behavior:
- Run 'bash scripts/docker/build.sh'
- Run 'bash scripts/docker/launch.sh'
- Chang some codes (in /workspace/bert/data/create_datasets_from_start.sh) to comments. ''' #Download #download_wikipedia --outdir ${BERT_PREP_WORKING_DIR}/wikipedia/ python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab python3 /workspace/bert/data/bertPrep.py --action download --dataset squad #python3 /workspace/bert/data/bertPrep.py --action download --dataset mrpc #python3 /workspace/bert/data/bertPrep.py --action download --dataset sst-2 '''
- Run '/workspace/bert/data/create_datasets_from_start.sh'
- Download bert-base-uncased-qa checkpoint which is in README.md
- Changed scripts/run_squad.sh ''' init_checkpoint=${1:-"/workspace/bert/checkpoints/bert_base_qa.pt"} # changed code epochs=${2:-"2.0"} batch_size=${3:-"4"} learning_rate=${4:-"3e-5"} warmup_proportion=${5:-"0.1"} precision=${6:-"fp32"} # changed code num_gpu=${7:-"1"} # changed code seed=${8:-"1"} squad_dir=${9:-"$BERT_PREP_WORKING_DIR/download/squad/v1.1"} vocab_file=${10:-"$BERT_PREP_WORKING_DIR/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt"} # changed code for inference OUT_DIR=${10:-"/workspace/bert/results/SQuAD"} # changed code for inference mode=${11:-"prediction"} # changed code for inference CONFIG_FILE=${13:-"/workspace/bert/bert_configs/base.json"} # changed code max_steps=${14:-"-1"} ''' ''' elif [ "$mode" = "prediction" ] ; then CMD+="--do_predict " CMD+="--predict_file=$squad_dir/dev-v1.1.json " CMD+="--predict_batch_size=$batch_size " CMD+="--eval_script=$squad_dir/evaluate-v1.1.py " # additional code CMD+="--do_eval " # additional code '''
- Run 'bash scripts/run_squad.sh checkpoints/bert_base_qa.pt'
- Get the results ''' Container nvidia build = 29224839 out dir is /workspace/bert/results/SQuAD python run_squad.py --init_checkpoint=checkpoints/bert_base_qa.pt --do_predict --predict_file=/workspace/bert/data/download/squad/v1.1/dev-v1.1.json --predict_batch_size=4 --eval_script=/workspace/bert/data/download/squad/v1.1/evaluate-v1.1.py --do_eval --do_lower_case --bert_model=bert-base-uncased --learning_rate=3e-5 --warmup_proportion=0.1 --seed=1 --num_train_epochs=2.0 --max_seq_length=384 --doc_stride=128 --output_dir=/workspace/bert/results/SQuAD --vocab_file=/workspace/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt --config_file=/workspace/bert/bert_configs/base.json --max_steps=-1 |& tee /workspace/bert/results/SQuAD/logfile.txt device: cuda n_gpu: 1, distributed training: False, 16-bits training: False DLL 2022-07-22 07:13:49.447320 - PARAMETER Config : ["Namespace(amp=False, bert_model='bert-base-uncased', cache_dir=None, config_file='/workspace/bert/bert_configs/base.json', disable_progress_bar=False, do_eval=True, do_lower_case=True, do_predict=True, do_train=False, doc_stride=128, eval_script='/workspace/bert/data/download/squad/v1.1/evaluate-v1.1.py', fp16=False, gradient_accumulation_steps=1, init_checkpoint='checkpoints/bert_base_qa.pt', json_summary='results/dllogger.json', learning_rate=3e-05, local_rank=-1, log_freq=50, loss_scale=0, max_answer_length=30, max_query_length=64, max_seq_length=384, max_steps=-1.0, n_best_size=20, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=2.0, output_dir='/workspace/bert/results/SQuAD', predict_batch_size=4, predict_file='/workspace/bert/data/download/squad/v1.1/dev-v1.1.json', profile=False, seed=1, skip_cache=False, skip_checkpoint=False, train_batch_size=32, train_file=None, use_env=False, verbose_logging=False, version_2_with_negative=False, vocab_file='/workspace/bert/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt', warmup_proportion=0.1)"] DLL 2022-07-22 07:13:49.447650 - PARAMETER SEED : 1 DLL 2022-07-22 07:13:51.179013 - PARAMETER loading_checkpoint : True DLL 2022-07-22 07:13:51.404388 - PARAMETER loaded_checkpoint : True DLL 2022-07-22 07:13:52.896981 - PARAMETER model_weights_num : 109488386 DLL 2022-07-22 07:14:37.599428 - PARAMETER infer_start : True DLL 2022-07-22 07:14:37.599495 - PARAMETER eval_samples : 10570 DLL 2022-07-22 07:14:37.599525 - PARAMETER eval_features : 10833 DLL 2022-07-22 07:14:37.599550 - PARAMETER predict_batch_size : 4 DLL 2022-07-22 07:14:38.026544 - PARAMETER eval_start : True Evaluating: 100%|██████████| 2709/2709 [14:38<00:00, 3.08it/s] DLL 2022-07-22 07:14:38.028848 - PARAMETER sample_number : 0 DLL 2022-07-22 07:15:58.197601 - PARAMETER sample_number : 1000 DLL 2022-07-22 07:17:18.666884 - PARAMETER sample_number : 2000 DLL 2022-07-22 07:18:42.103769 - PARAMETER sample_number : 3000 DLL 2022-07-22 07:20:02.731350 - PARAMETER sample_number : 4000 DLL 2022-07-22 07:21:27.272142 - PARAMETER sample_number : 5000 DLL 2022-07-22 07:22:48.466350 - PARAMETER sample_number : 6000 DLL 2022-07-22 07:24:10.555928 - PARAMETER sample_number : 7000 DLL 2022-07-22 07:25:30.555780 - PARAMETER sample_number : 8000 DLL 2022-07-22 07:26:51.113897 - PARAMETER sample_number : 9000 DLL 2022-07-22 07:28:09.423703 - PARAMETER sample_number : 10000 DLL 2022-07-22 07:30:13.794959 - e2e_inference_time : 878.3164570331573 s inference_sequences_per_second : 12.333823319891458 sequences/s DLL 2022-07-22 07:30:13.795107 - exact_match : 81.46641438032167 None F1 : 88.68670097471168 None
real 16m26.373s user 16m24.836s sys 0m2.439s '''
Expected behavior A clear and concise description of what you expected to happen. I cannot understand why I got None at exact_match and f1 score. You can see None in the results (DLL 2022-07-22 07:30:13.795107 - exact_match : 81.46641438032167 None F1 : 88.68670097471168 None) I want to get exact_match and f1 score without None and I don't understand what None does mean. And I want to know how to get accuracy in prediction mode.
Environment Please provide at least:
- Container version (e.g. pytorch:19.05-py3):
- GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): GTX 1650
- CUDA driver version (e.g. 418.67):
you don't have to modify the script. When mode="prediction", only predictons are output. When mode="train eval" or mode="eval", metrics are computed and output.
you don't have to modify the script. When
mode="prediction", only predictons are output. Whenmode="train eval"ormode="eval", metrics are computed and output.
Okay, then you mean, even though I got an f1 score with the modified script, I got an wrong score? And still I'm confused between prediction and evaluation. Is it okay to use evaluation instead of inference?