由于本人条件有限,只有一台机器一个显卡,完成好配置后,无法解决下面的NCCL error:
Traceback (most recent call last):
File "finetune_text_generation_src.py", line 324, in
main()
File "finetune_text_generation_src.py", line 208, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "/root/workspace/CPM-1-Finetune-Text-Generation/utils.py", line 493, in setup_model_and_optimizer
model = get_model(args, model_cls)
File "/root/workspace/CPM-1-Finetune-Text-Generation/utils.py", line 419, in get_model
model = DDP(model)
File "/root/workspace/CPM-1-Finetune-Text-Generation/model/distributed.py", line 35, in init
dist.broadcast(p, src_rank, group=self.data_parallel_group)
File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 846, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "/root/anaconda3/envs/cpm/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/cpm/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/cpm/bin/python3', '-u', 'finetune_text_generation_src.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed/', '--model-parallel-size', '1', '--num-layers', '5', '--hidden-size', '2560', '--load', 'checkpoints/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/root/workspace/CPM-1-Finetune-Text-Generation/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations']' returned non-zero exit status 1.
大佬们, 如何使用单卡训练而避免使用NCCL呢?
单卡,脚本参数nproc_per_node
和model-parallel-size
需要改为1