transformers icon indicating copy to clipboard operation
transformers copied to clipboard

GPT-J evaluation with multiple GPUs crashes

Open manuelciosici opened this issue 1 year ago • 3 comments

System Info

  • transformers version: 4.21.0
  • Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.4
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (2+ RTX A6000)
  • Using distributed or parallel set-up in script?: Yes

The issue appears when parallelizing with python -m torch.distributed.launch --nproc_per_node=2 and also when parallelizing with deepspeed

Who can help?

I hope @patil-suraj, @stas00, or @sgugger.

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Run the run_clm.py script from the examples directory: python -m torch.distributed.launch --nproc_per_node=4 /path/to/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path "EleutherAI/gpt-j-6B" --do_eval --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --output_dir "${output_dir}/output_fine_tune" --eval_steps 1 --evaluation_strategy steps --per_device_eval_batch_size 4 --block_size 2048
  2. The script crashes with the following error:
08/01/2022 08:51:08 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:2891] 2022-08-01 08:51:22,867 >> ***** Running Evaluation *****
[INFO|trainer.py:2893] 2022-08-01 08:51:22,868 >>   Num examples = 119
[INFO|trainer.py:2896] 2022-08-01 08:51:22,868 >>   Batch size = 4
Traceback (most recent call last):
  File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in <module>
Traceback (most recent call last):
  File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in <module>
    main()
  File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 545, in main
    main()
  File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 545, in main
    metrics = trainer.evaluate()
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in evaluate
    metrics = trainer.evaluate()
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in evaluate
    output = eval_loop(
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2960, in evaluation_loop
    output = eval_loop(
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2960, in evaluation_loop
    logits = self._nested_gather(logits)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 3072, in _nested_gather
    logits = self._nested_gather(logits)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 3072, in _nested_gather
    tensors = distributed_concat(tensors)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
    tensors = distributed_concat(tensors)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
        return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat

  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 181, in distributed_concat
    return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 181, in distributed_concat
    dist.all_gather(output_tensors, tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2068, in all_gather
    dist.all_gather(output_tensors, tensor)
  File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2068, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous

Some debugging

  • The crash only appears when the compute_metrics argument to Trainer is not None. In other words, replacing the line compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None, with compute_metrics=None prevents the script from crashing.
  • It looks like the logits on Trainer line 3181 https://github.com/huggingface/transformers/blob/a9eee2ffecc874df7dd635b2c6abb246fdb318cc/src/transformers/trainer.py#L3181 are not contiguous.
  • If I force the tensors to be contiguous with the patch below, run_clm no longer crashes. I do not think the issue is in Trainer, so the patch below is not a fix. I include it only to help with debugging.

Patch to make tensors contiguous

2850,2875d2849
<     def check_contiguous(self, tensor) -> Tuple[int, int]:
<         if tensor is None:
<             return 0, 0
<         if isinstance(tensor, (list, tuple)):
<             first = 0
<             total = 0
<             for t in tensor:
<                 f, t = self.check_contiguous(t)
<                 first += f
<                 total += t
<             return first, total
<         else:
<             f = 0
<             t = 1
<             if tensor.is_contiguous():
<                 f = 1
<             return f, t
<
<     def make_contiguous(self, tensor):
<         if tensor is None:
<             return None
<         if isinstance(tensor, (list, tuple)):
<             return tuple(self.make_contiguous(t) for t in tensor)
<         else:
<             return tensor.contiguous()
<
3208,3216d3181
<                         cont, total = self.check_contiguous(logits)
<                         if cont != total:
<                             print(
<                                 f"[DebugTrainer] prediction_step, no sm, outputs dict logits (cont, total)"
<                                 f"{(cont, total)}")
<                             logits= self.make_contiguous(logits)
<                             print(
<                                 f"[DebugTrainer] prediction_step, no sm, outputs dict, after contiguous, logits (cont, total)"
<                                 f"{self.check_contiguous(logits)}")

Expected behavior

The script should finish running and report the evaluation results (loss and accuracy).

manuelciosici avatar Aug 01 '22 16:08 manuelciosici

The issue is probably in the modeling code missing some .contiguous() calls.

sgugger avatar Aug 01 '22 16:08 sgugger

I can reproduce the error with GPT-J. This also happens with Salesforce/codegen-16B-nl and EleutherAI/gpt-neox-20b. In all cases the error is RuntimeError: Tensors must be contiguous.

The problem doesn't occur with gpt2-xl and facebook/opt-13b.

This on Transformers 4.21.1 and also using 2x RTX a6000 GPUs.

The problem was also reproduced by another dev training gpt-neoX-20b on 2x a6000.

Could this be a6000 related?

timohear avatar Aug 24 '22 08:08 timohear

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 17 '22 15:09 github-actions[bot]