nebuly [Chatllama] Errors when training actor model based on LLaMA-7B

root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 292, in init self.model = ActorModel(config) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 52, in init local_rank, world_size = setup_model_parallel() File "/usr/local/lib/python3.9/site-packages/chatllama/llama_model.py", line 551, in setup_model_parallel torch.distributed.init_process_group("nccl") File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

Mar 12 '23 08:03 young-chao

Do I need to use torchrun to start？

Mar 12 '23 08:03 young-chao

I got the error too! The WORLD_SIZE can not be get

Mar 12 '23 15:03 leonselina

root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 292, in init self.model = ActorModel(config) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 52, in init local_rank, world_size = setup_model_parallel() File "/usr/local/lib/python3.9/site-packages/chatllama/llama_model.py", line 551, in setup_model_parallel torch.distributed.init_process_group("nccl") File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

you can set environment varibales:

export WORLD_SIZE=1 export LOCAL_RANK=0 export RANK=0 export MASTER_ADDR=127.0.0.1 export MASTER_PORT=23456

python artifacts/main.py artifacts/config/config.yaml --type ACTOR

Mar 14 '23 05:03 jyGuan

@jyGuan Great Work! Would you mind to open a PR adding a Distributed Section with the instruction to set the environment variables? It would be of great help for everyone who is facing the same issue. Thank you!

Mar 14 '23 09:03 PierpaoloSorbellini

nebuly nebuly copied to clipboard

[Chatllama] Errors when training actor model based on LLaMA-7B

nebuly
nebuly copied to clipboard