transformers
transformers copied to clipboard
GPT-J evaluation with multiple GPUs crashes
System Info
-
transformers
version: 4.21.0 - Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.4
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes (2+ RTX A6000)
- Using distributed or parallel set-up in script?: Yes
The issue appears when parallelizing with python -m torch.distributed.launch --nproc_per_node=2
and also when parallelizing with deepspeed
Who can help?
I hope @patil-suraj, @stas00, or @sgugger.
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Run the
run_clm.py
script from the examples directory:python -m torch.distributed.launch --nproc_per_node=4 /path/to/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path "EleutherAI/gpt-j-6B" --do_eval --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --output_dir "${output_dir}/output_fine_tune" --eval_steps 1 --evaluation_strategy steps --per_device_eval_batch_size 4 --block_size 2048
- The script crashes with the following error:
08/01/2022 08:51:08 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:2891] 2022-08-01 08:51:22,867 >> ***** Running Evaluation *****
[INFO|trainer.py:2893] 2022-08-01 08:51:22,868 >> Num examples = 119
[INFO|trainer.py:2896] 2022-08-01 08:51:22,868 >> Batch size = 4
Traceback (most recent call last):
File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in <module>
Traceback (most recent call last):
File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 579, in <module>
main()
File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 545, in main
main()
File "/path/to/transformers/examples/pytorch/language-modeling/run_clm.py", line 545, in main
metrics = trainer.evaluate()
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in evaluate
metrics = trainer.evaluate()
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in evaluate
output = eval_loop(
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2960, in evaluation_loop
output = eval_loop(
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 2960, in evaluation_loop
logits = self._nested_gather(logits)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 3072, in _nested_gather
logits = self._nested_gather(logits)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer.py", line 3072, in _nested_gather
tensors = distributed_concat(tensors)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
tensors = distributed_concat(tensors)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 178, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 181, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 181, in distributed_concat
dist.all_gather(output_tensors, tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2068, in all_gather
dist.all_gather(output_tensors, tensor)
File "/nas/minlp/users/cwc/manuelc/miniconda3/envs/dsaiodocs/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2068, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be contiguous
Some debugging
- The crash only appears when the
compute_metrics
argument toTrainer
is notNone
. In other words, replacing the linecompute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
withcompute_metrics=None
prevents the script from crashing. - It looks like the logits on Trainer line 3181 https://github.com/huggingface/transformers/blob/a9eee2ffecc874df7dd635b2c6abb246fdb318cc/src/transformers/trainer.py#L3181 are not contiguous.
- If I force the tensors to be contiguous with the patch below, run_clm no longer crashes. I do not think the issue is in
Trainer
, so the patch below is not a fix. I include it only to help with debugging.
Patch to make tensors contiguous
2850,2875d2849
< def check_contiguous(self, tensor) -> Tuple[int, int]:
< if tensor is None:
< return 0, 0
< if isinstance(tensor, (list, tuple)):
< first = 0
< total = 0
< for t in tensor:
< f, t = self.check_contiguous(t)
< first += f
< total += t
< return first, total
< else:
< f = 0
< t = 1
< if tensor.is_contiguous():
< f = 1
< return f, t
<
< def make_contiguous(self, tensor):
< if tensor is None:
< return None
< if isinstance(tensor, (list, tuple)):
< return tuple(self.make_contiguous(t) for t in tensor)
< else:
< return tensor.contiguous()
<
3208,3216d3181
< cont, total = self.check_contiguous(logits)
< if cont != total:
< print(
< f"[DebugTrainer] prediction_step, no sm, outputs dict logits (cont, total)"
< f"{(cont, total)}")
< logits= self.make_contiguous(logits)
< print(
< f"[DebugTrainer] prediction_step, no sm, outputs dict, after contiguous, logits (cont, total)"
< f"{self.check_contiguous(logits)}")
Expected behavior
The script should finish running and report the evaluation results (loss and accuracy).
The issue is probably in the modeling code missing some .contiguous()
calls.
I can reproduce the error with GPT-J. This also happens with Salesforce/codegen-16B-nl and EleutherAI/gpt-neox-20b. In all cases the error is RuntimeError: Tensors must be contiguous.
The problem doesn't occur with gpt2-xl and facebook/opt-13b.
This on Transformers 4.21.1 and also using 2x RTX a6000 GPUs.
The problem was also reproduced by another dev training gpt-neoX-20b on 2x a6000.
Could this be a6000 related?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
+1 for this issue, still having problems with Tensors must be contiguous
error in evaluation.
I have same problem