llama
llama copied to clipboard
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
Tried setting the value of RANK to 1 with set RANK=1
and then was asked about WORLD_SIZE which I set to 1 as well, then MASTER_ADDR=localhost and last MASTER_PORT=12345.
Now it's stuck in a loop sending another fail message: [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:12345 (system error: 10049 - unknown error).
The way I ran was, setting up everything as asked, then editing example_chat_completion.py with the proper paths and then running it with the right conda env. I'm on windows so I had to use bash
to run the download.sh, apart from that everything else was run on a admin cmd.
torchrun example_chat_completion.py ......
!python -m torch.distributed.launch llama/example_chat_completion.py
This worked for me
I have tried it but it keep asking for RANK info
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
By the way, I'm on windows11.
Here is when I run it on cmd: https://pastebin.com/JcYcheD2
@bomxacalaka , Having the exact same error here. Please, have you been able to resolve this yet?
I get the same error on windows 2019
Hi, I am getting on linux. ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set
Update: after reading about the launch method and the method torch.distributed.init_process_group. I found that launch was missing the parameter dist_url, which in my case was set to None, I set it to auto and now is working. dist.launch(main, n_gpu_per_machine=args.n_gpu,n_machine=1, machine_rank=0, dist_url="auto", args=(args,)) Note: I am using tensorfn Hope this help you
torchrun example_chat_completion.py ......
but how to run it in jupyter please, put this .py in Jupyter still will meet the same problem in this step
if not torch.distributed.is_initialized(): torch.distributed.init_process_group("nccl")
For windows you could try to replace
torch.distributed.init_process_group("nccl")
in generation.py
, with something like this:
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12345'
torch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)
This allows me to load the 7B model on a single GPU.
Has anyone been able to solve this problem? If so, what is the solution?
!python -m torch.distributed.launch llama/example_chat_completion.py
This worked for me
It works, but why?
how to put 'torch.distributed.launch
' into py file?
torchrun example_chat_completion.py ......
but how to run it in jupyter please, put this .py in Jupyter still will meet the same problem in this step
if not torch.distributed.is_initialized(): torch.distributed.init_process_group("nccl")
you can use python os package to set the env. like this
# 设置环境变量
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '5678'