DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[BERT/PyTorch] Exception pop-out when trying to distillation on SQuAD dataset

Open ChongyuNVIDIA opened this issue 2 years ago • 0 comments

Related to BERT/PyTorch

Describe the bug When I try to reproduce the BERT distillation training result on the SQuAD dataset as described in: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/distillation and use the script https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/distillation/run_e2e_distillation.sh

  1. In the stage: #Distillation SQUAD, backbone loss Meet the " RuntimeError: Socket Timeout " issue when executing 'load_and_cache_examples' as follows:
...
convert squad examples to features:   4%|▍         | 57505/1397275 [14:21<6:50:58, 54.33it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "task_distill.py", line 1351, in <module>
    main()
  File "task_distill.py", line 917, in main
    train_data = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
  File "/ngc_0/Chong_dxxz_Projects/Gitlab/Efficient_Transformers_Exploration/DeepLearningExamples/LanguageModeling/BERT/distillation/utils/squad/squad_utils.py", line 877, in load_and_cache_examples
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2735, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
  1. In the stage: #Distillation SQUAD, prediction loss Meet the " OSError: " issue when executing 'task_distill.py' as follows:
...
OSError: Model name 'checkpoints/nv_distill_squad' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). We assumed 'checkpoints/nv_distill_squad' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

To Reproduce Steps to reproduce the behavior:

  1. Git clone 'https://github.com/NVIDIA/DeepLearningExamples', cd DeepLearningExamples/PyTorch/LanguageModeling/BERT/distillation
  2. Build BERT on top of the NGC container: 'bash scripts/docker/build.sh', Start an interactive session in the NGC container to run training/inference: 'bash scripts/docker/launch.sh'
  3. Setup Knowledge Distillation on BERT: 'bash utils/perform_distillation_prereqs.sh'
  4. Run the following to produce fully distilled BERT models for SQuADv1.1 and SST-2: 'bash run_e2e_distillation.sh'
  5. The #Distillation SQUAD, backbone loss and #Distillation SQUAD, prediction loss both meet the exceptions.

Expected behavior The Distillation phase1 & phase2 and distillation on SQuADv1.1 and SST-2 should work smoothly.

Environment

  • Container version: pytorch:21.11-py3
  • GPUs in the system: 8x Tesla V100-32GB
  • CUDA version: 11.5
  • CUDA driver version: 450.119.04

ChongyuNVIDIA avatar Apr 12 '22 08:04 ChongyuNVIDIA