nebuly
nebuly copied to clipboard
[Chatllama] Errors when training actor model based on LLaMA-7B
root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR
Current device used :cuda
local_rank: -1 world_size: -1
Traceback (most recent call last):
File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in
Do I need to use torchrun to start?
I got the error too! The WORLD_SIZE can not be get
root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 292, in init self.model = ActorModel(config) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line 52, in init local_rank, world_size = setup_model_parallel() File "/usr/local/lib/python3.9/site-packages/chatllama/llama_model.py", line 551, in setup_model_parallel torch.distributed.init_process_group("nccl") File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 236, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 221, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
you can set environment varibales:
export WORLD_SIZE=1 export LOCAL_RANK=0 export RANK=0 export MASTER_ADDR=127.0.0.1 export MASTER_PORT=23456
python artifacts/main.py artifacts/config/config.yaml --type ACTOR
@jyGuan Great Work! Would you mind to open a PR adding a Distributed Section with the instruction to set the environment variables? It would be of great help for everyone who is facing the same issue. Thank you!