llama icon indicating copy to clipboard operation
llama copied to clipboard

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

Open bomxacalaka opened this issue 1 year ago • 13 comments

Tried setting the value of RANK to 1 with set RANK=1 and then was asked about WORLD_SIZE which I set to 1 as well, then MASTER_ADDR=localhost and last MASTER_PORT=12345.

Now it's stuck in a loop sending another fail message: [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:12345 (system error: 10049 - unknown error).

The way I ran was, setting up everything as asked, then editing example_chat_completion.py with the proper paths and then running it with the right conda env. I'm on windows so I had to use bash to run the download.sh, apart from that everything else was run on a admin cmd.

bomxacalaka avatar Jul 19 '23 06:07 bomxacalaka

torchrun example_chat_completion.py ......

AshutoshDongare avatar Jul 19 '23 18:07 AshutoshDongare

!python -m torch.distributed.launch llama/example_chat_completion.py This worked for me

jkyalo-go avatar Jul 23 '23 16:07 jkyalo-go

I have tried it but it keep asking for RANK info ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

By the way, I'm on windows11.

Here is when I run it on cmd: https://pastebin.com/JcYcheD2

bomxacalaka avatar Jul 23 '23 17:07 bomxacalaka

@bomxacalaka , Having the exact same error here. Please, have you been able to resolve this yet?

Chuukwudi avatar Jul 24 '23 09:07 Chuukwudi

I get the same error on windows 2019

IgorSvazincev avatar Jul 25 '23 09:07 IgorSvazincev

Hi, I am getting on linux. ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

ClaudiaGiardina90 avatar Jul 26 '23 08:07 ClaudiaGiardina90

Update: after reading about the launch method and the method torch.distributed.init_process_group. I found that launch was missing the parameter dist_url, which in my case was set to None, I set it to auto and now is working. dist.launch(main, n_gpu_per_machine=args.n_gpu,n_machine=1, machine_rank=0, dist_url="auto", args=(args,)) Note: I am using tensorfn Hope this help you

ClaudiaGiardina90 avatar Jul 27 '23 07:07 ClaudiaGiardina90

torchrun example_chat_completion.py ......

but how to run it in jupyter please, put this .py in Jupyter still will meet the same problem in this step if not torch.distributed.is_initialized(): torch.distributed.init_process_group("nccl")

ShiningorDying avatar Nov 13 '23 09:11 ShiningorDying

For windows you could try to replace

torch.distributed.init_process_group("nccl")

in generation.py, with something like this:

 os.environ['MASTER_ADDR'] = 'localhost'
 os.environ['MASTER_PORT'] = '12345'
 torch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)

This allows me to load the 7B model on a single GPU.

sramshetty avatar Mar 09 '24 11:03 sramshetty

Has anyone been able to solve this problem? If so, what is the solution?

RylanSchaeffer avatar Apr 28 '24 01:04 RylanSchaeffer

!python -m torch.distributed.launch llama/example_chat_completion.py This worked for me

It works, but why?

guotong1988 avatar May 30 '24 02:05 guotong1988

how to put 'torch.distributed.launch' into py file?

guotong1988 avatar Jun 04 '24 01:06 guotong1988

torchrun example_chat_completion.py ......

but how to run it in jupyter please, put this .py in Jupyter still will meet the same problem in this step if not torch.distributed.is_initialized(): torch.distributed.init_process_group("nccl")

you can use python os package to set the env. like this

# 设置环境变量
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '5678'

Skylarking avatar Jul 24 '24 03:07 Skylarking