nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

[Chatllama] Errors when training actor model based on LLaMA-7B

Open young-chao opened this issue 1 year ago • 4 comments

root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 292, in init self.model = ActorModel(config) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 52, in init local_rank, world_size = setup_model_parallel() File "/usr/local/lib/python3.9/site-packages/chatllama/llama_model.py", line 551, in setup_model_parallel torch.distributed.init_process_group("nccl") File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

young-chao avatar Mar 12 '23 08:03 young-chao

Do I need to use torchrun to start?

young-chao avatar Mar 12 '23 08:03 young-chao

I got the error too! The WORLD_SIZE can not be get

leonselina avatar Mar 12 '23 15:03 leonselina

root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 292, in init self.model = ActorModel(config) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 52, in init local_rank, world_size = setup_model_parallel() File "/usr/local/lib/python3.9/site-packages/chatllama/llama_model.py", line 551, in setup_model_parallel torch.distributed.init_process_group("nccl") File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

you can set environment varibales:

export WORLD_SIZE=1 export LOCAL_RANK=0 export RANK=0 export MASTER_ADDR=127.0.0.1 export MASTER_PORT=23456

python artifacts/main.py artifacts/config/config.yaml --type ACTOR

jyGuan avatar Mar 14 '23 05:03 jyGuan

@jyGuan Great Work! Would you mind to open a PR adding a Distributed Section with the instruction to set the environment variables? It would be of great help for everyone who is facing the same issue. Thank you!

PierpaoloSorbellini avatar Mar 14 '23 09:03 PierpaoloSorbellini