Open-Sora
Open-Sora copied to clipboard
inference bug
when i run script from the mentioned, it occured following bug it seems to be related to the code "colossalai.launch_from_torch({})" (from inference.py)
how can i solve it? thanks!
[W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:601] [c10d] The IPv6 network addresses of (nma08-101-c-07-sev-nf5468-04u04, 52925) cannot be retrieved (gai error: -2 - Name or service not known).
Hi, could you reproduce this issue with the following steps?
- write a
test.py
import torch
torch.distributed.init_process_group(backend="nccl")
print("test")
- invoke this script with the following commands
torchrun --standalone --nproc_per_node 1 test.py
Hi, could you reproduce this issue with the following steps?
- write a
test.pyimport torch torch.distributed.init_process_group(backend="nccl") print("test")
- invoke this script with the following commands
torchrun --standalone --nproc_per_node 1 test.py
i just try above, it runs normally.
I see, that's weird. While I try to debug this, you can replace colossalai.launch() with the following lines first.
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)
I use the 4090 run the test.py,there still have the same error.
same problem, I have this warning
[2024-03-18 17:04:07,488] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
I try to add one line to modify the master_addr in inference.py:
os.environ["MASTER_ADDR"] = "127.0.0.1"
colossalai.launch_from_torch({})
but it didn't work
我也是遇到这种问题?不知如何解决
I see, that's weird. While I try to debug this, you can replace
colossalai.launch()with the following lines first.import torch import torch.distributed as dist dist.init_process_group(backend="nccl") rank = dist.get_rank() torch.cuda.set_device(rank)
我已经在anconda环境安装了 torch,但是不知道为什么还是找不到torch
I solved the problem by adding the local host to "/etc/hosts" manually.
Try this: https://blog.csdn.net/lin_xiao_yi/article/details/132490694