unilm
unilm copied to clipboard
Error in Training on systems with only one GPU:
Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):
The problem arises when using:
- [ ] the official example scripts: (give details below)
- [ ] my own modified scripts: (give details below)
A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
- python examples/run_xfun_re.py --model_name_or_path microsoft/layoutxlm-base --output_dir /tmp/test-ner --do_train --do_eval --lang zh --max_steps 2500 --per_device_train_batch_size 2 --warmup_ratio 0.1 --fp16
Expected behavior A clear and concise description of what you expected to happen.
- Platform: RTX 3090, Ubuntu 20.04
- Python version: 3.7
- PyTorch version (GPU?): 1.10
Logs:
File "examples/run_xfun_re.py", line 245, in
I'm facing the same issue. Training was possible on 1 GPU, but error in evaluation.
I got the same problem and fixed it by changing"self.args.local_rank = torch.distributed.get_rank()" to "self.args.local_rank = -1" (xfun_trainer.py, line 178)