ray
ray copied to clipboard
[Ray train][Quick start demo] socket.cpp:[c10d] system error: 10049
What happened + What you expected to happen
I want to run the ray train quick start demo on windows 10 and only use the cpu, while it shows the socket.cpp errror. On one host, the code can keep going on, while on multi node, it will stuck. I wonder if this error caused this stuck problem and how to resolved it. I have set the environment RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER to 1.
The running logs:
2024-01-30 17:28:56,097 INFO worker.py:1642 -- Started a local Ray instance.
2024-01-30 17:29:00,157 INFO tune.py:229 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2024-01-30 17:29:00,158 INFO tune.py:655 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
View detailed results here: C:/projects/python_project/code_scan/ray_results/TorchTrainer_2024-01-30_17-28-47
To visualize your results with TensorBoard, run: `tensorboard --logdir C:/Users/taoche/ray_results/TorchTrainer_2024-01-30_17-28-47`
2024-01-30 17:29:00,215 INFO data_parallel_trainer.py:408 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(pid=115980)
Training started without custom configuration.
(TrainTrainable pid=121732) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=121732) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=121732) Starting distributed worker processes: ['48872 (127.0.0.1)', '115976 (127.0.0.1)', '102288 (127.0.0.1)', '115052 (127.0.0.1)', '84140 (127.0.0.1)', '96412 (127.0.0.1)', '99324 (127.0.0.1)', '103072 (127.0.0.1)']
(RayTrainWorker pid=48872) Setting up process group for: env:// [rank=0, world_size=8]
(RayTrainWorker pid=48872) [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:65151 (system error: 10049 - 在其上下文中,该请求的地址无效。).
(RayTrainWorker pid=48872) Moving model to device: cpu
Versions / Dependencies
windows10 / 11 ray 2.9.1 python3.10.11
Reproduction script
import tempfile
import torch
from torchvision.models import resnet18, alexnet
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
import datetime
import ray
# use ray framework
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, Checkpoint, RunConfig
import torch
from torch import nn
import torch
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.model1 = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=1, padding=1),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Flatten(),
nn.Linear(128 * 8 * 8, 256),
nn.Dropout(p=0.5),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.model1(x)
def train_func(config):
model = Model()
# use ray framework prepare model
model = ray.train.torch.prepare_model(model)
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
# Data
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
# local data directory path
train_data = FashionMNIST(root="C:/projects/python_project/code_scan/data", train=True, download=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
# ray prepare dataloader
train_loader = ray.train.torch.prepare_data_loader(train_loader)
# Training
start_time = datetime.datetime.now()
for epoch in range(2):
print(epoch)
i = 0
print(len(train_loader))
for images, labels in train_loader:
i += 1
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("epoch: " + str(epoch) + " " + str(len(train_loader)) + '/' + str(i))
checkpoint_dir = "C:/projects/python_project/code_scan/model" # local checkpoint_dir path
checkpoint_path = checkpoint_dir + "/model.checkpoint"
torch.save(model.state_dict(), checkpoint_path)
ray.train.report({"loss": loss.item()}, checkpoint=Checkpoint.from_directory(checkpoint_dir))
end_time = datetime.datetime.now()
print("total duration: " + str((end_time - start_time).total_seconds()))
# [4] Configure scaling and resource requirements.
scaling_config = ScalingConfig(num_workers=8, use_gpu=False)
# scaling_config = ScalingConfig(num_workers=2, use_gpu=False, resources_per_worker={"CPU": 6})
run_config = RunConfig(storage_path="C:/projects/python_project/code_scan/ray_results") # local ray results path
# [5] Launch distributed training job.
trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)
result = trainer.fit()
Issue Severity
High: It blocks me from completing my task.
import tempfile
import torch
from torchvision.models import resnet18, alexnet
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
import datetime
import ray
# use ray framework
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, Checkpoint, RunConfig
import torch
from torch import nn
import torch
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.model1 = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=1, padding=1),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Flatten(),
nn.Linear(128 * 8 * 8, 256),
nn.Dropout(p=0.5),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.model1(x)
def train_func(config):
model = Model()
# use ray framework prepare model
model = ray.train.torch.prepare_model(model)
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.001)
# Data
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
# local data directory path
train_data = FashionMNIST(root="C:/projects/python_project/code_scan/data", train=True, download=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
# ray prepare dataloader
train_loader = ray.train.torch.prepare_data_loader(train_loader)
# Training
start_time = datetime.datetime.now()
for epoch in range(2):
print(epoch)
i = 0
print(len(train_loader))
for images, labels in train_loader:
i += 1
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("epoch: " + str(epoch) + " " + str(len(train_loader)) + '/' + str(i))
checkpoint_dir = "C:/projects/python_project/code_scan/model" # local checkpoint_dir path
checkpoint_path = checkpoint_dir + "/model.checkpoint"
torch.save(model.state_dict(), checkpoint_path)
ray.train.report({"loss": loss.item()}, checkpoint=Checkpoint.from_directory(checkpoint_dir))
end_time = datetime.datetime.now()
print("total duration: " + str((end_time - start_time).total_seconds()))
# [4] Configure scaling and resource requirements.
scaling_config = ScalingConfig(num_workers=8, use_gpu=False)
# scaling_config = ScalingConfig(num_workers=2, use_gpu=False, resources_per_worker={"CPU": 6})
run_config = RunConfig(storage_path="C:/projects/python_project/code_scan/ray_results") # local ray results path
# [5] Launch distributed training job.
trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)
result = trainer.fit()
This seems to be an issue with PyTorch/Windows and not Ray.
@mattip can you repro in the coming week?
In general, clusters are not supported by ray using windows.
I need more information to reproduce. @mct211 what exactly do you mean by
On one host, the code can keep going on, while on multi node, it will stuck.
How are you running the reproducer script? How are you setting up the cluster hinted at by the error line
(RayTrainWorker pid=48872) [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] \
The client socket has failed to connect to [kubernetes.docker.internal]:65151 (system error: 10049 - 在其上下文中,该请求的地址无效。).
System error 10049 is The requested address is not valid in its context.
, which would hint at a network configuration problem with the kubernetes.docker.internal
node, so as much detail as possible about how you set things up is required in order to help you work through this (unsupported) mode of running ray.
@mattip hi, mattip. Thank you for your comments.
"On one host, the code can keep going on, while on multi node, it will stuck." means, if i use only one windows host to run the demo script, the 10049 error will also appear, but the training process can continue and the training process was normal. When i use another windows host to join the cluster( ray start --address='xxxxxx') and then run the code on head node, the error will appear and the running process will stuck, the training process doesn't seem to have started.
I set the environment variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 and the head node execute the command " ray start --head --node-ip-address=localhost --port='xxxx' ", another host node execute the command " ray start --address='xxxxxx' to join the cluster. And then the head node execute the python demo script. Then the error occurred. BTW, i only use the CPU to run the demo script.
Thanks!
Sorry, I am a slow learner. Can you provide instructions exactly how to reproduce the problem (only what doesn't work, not what does work)? Something like: On windows computer A I set these environment variables and run this code. Then on computer B I set these other variables and run this other code.
@mattip Hi mattip, the windows computer A and windows computer B are on the same LAN. For example, the A's ip is 192.168.1.2, the B's ip is 192.168.1.3. I set the computer A as the head node.
On A, i set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, execute the command "ray start --head --node-ip-address=localhost --port=6666 ".
On B, i set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, execute the command "ray start --address='192.168.1.2:6666' " (B has joined the cluster). Then computer A run this code.
Tips: only use the CPU to run the demo script. In my case, run the "ray status" command, computer A shows 8 cpus, computer B shows 8 cpus too, the cluster shows total 16 cpus. So i set the "ScalingConfig(num_workers=8(>=8), use_gpu=False), try to call the two computers' cpus at the same time.
The error is coming from pytorch here. Searching around for such errors in pytorch, I see https://github.com/pytorch/pytorch/issues/77523 with no solution, and https://github.com/pytorch/pytorch/issues/80638 that does have a solution but I am not sure how to apply it to ray.
OK, thanks mattip, i'll look into it. If you know how to apply it to ray later, please contact me thanks.
OK, thanks mattip, i'll look into it. If you know how to apply it to ray later, please contact me thanks.
Hi, I'm running the same situation Also run two win10 pc use docker in LAN I am not set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, so the computer B as worker node connect to Head node,but dead about 20s I will try set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 But did solved your problem, I think i will have the same issue with you