unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Error in Training on systems with only one GPU:

Open ChidanandKumarKS opened this issue 2 years ago • 2 comments

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [ ] my own modified scripts: (give details below)

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. python examples/run_xfun_re.py --model_name_or_path microsoft/layoutxlm-base --output_dir /tmp/test-ner --do_train --do_eval --lang zh --max_steps 2500 --per_device_train_batch_size 2 --warmup_ratio 0.1 --fp16

Expected behavior A clear and concise description of what you expected to happen.

  • Platform: RTX 3090, Ubuntu 20.04
  • Python version: 3.7
  • PyTorch version (GPU?): 1.10

Logs: File "examples/run_xfun_re.py", line 245, in main() File "examples/run_xfun_re.py", line 230, in main metrics = trainer.evaluate() File "/home/chowkam/chowkamWkspc/unilm-master/layoutlmft/layoutlmft/trainers/xfun_trainer.py", line 178, in evaluate self.args.local_rank = torch.distributed.get_rank() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank default_pg = _get_default_group() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 411, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

ChidanandKumarKS avatar Nov 26 '22 17:11 ChidanandKumarKS

I'm facing the same issue. Training was possible on 1 GPU, but error in evaluation.

PritikaRamu avatar Feb 09 '23 16:02 PritikaRamu

I got the same problem and fixed it by changing"self.args.local_rank = torch.distributed.get_rank()" to "self.args.local_rank = -1" (xfun_trainer.py, line 178)

hiijar avatar Feb 24 '23 03:02 hiijar