ray icon indicating copy to clipboard operation
ray copied to clipboard

[Ray train][Quick start demo] socket.cpp:[c10d] system error: 10049

Open mct2611 opened this issue 1 year ago • 5 comments

What happened + What you expected to happen

I want to run the ray train quick start demo on windows 10 and only use the cpu, while it shows the socket.cpp errror. On one host, the code can keep going on, while on multi node, it will stuck. I wonder if this error caused this stuck problem and how to resolved it. I have set the environment RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER to 1.

The running logs:

2024-01-30 17:28:56,097	INFO worker.py:1642 -- Started a local Ray instance.
2024-01-30 17:29:00,157	INFO tune.py:229 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2024-01-30 17:29:00,158	INFO tune.py:655 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: C:/projects/python_project/code_scan/ray_results/TorchTrainer_2024-01-30_17-28-47
To visualize your results with TensorBoard, run: `tensorboard --logdir C:/Users/taoche/ray_results/TorchTrainer_2024-01-30_17-28-47`
2024-01-30 17:29:00,215	INFO data_parallel_trainer.py:408 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(pid=115980) 

Training started without custom configuration.
(TrainTrainable pid=121732) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=121732) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=121732) Starting distributed worker processes: ['48872 (127.0.0.1)', '115976 (127.0.0.1)', '102288 (127.0.0.1)', '115052 (127.0.0.1)', '84140 (127.0.0.1)', '96412 (127.0.0.1)', '99324 (127.0.0.1)', '103072 (127.0.0.1)']
(RayTrainWorker pid=48872) Setting up process group for: env:// [rank=0, world_size=8]
(RayTrainWorker pid=48872) [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:65151 (system error: 10049 - 在其上下文中,该请求的地址无效。).
(RayTrainWorker pid=48872) Moving model to device: cpu

Versions / Dependencies

windows10 / 11 ray 2.9.1 python3.10.11

Reproduction script

import tempfile
import torch
from torchvision.models import resnet18, alexnet
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
import datetime
import ray
# use ray framework
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, Checkpoint, RunConfig
import torch
from torch import nn
import torch

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.model1 = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=1, padding=1),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 256),
            nn.Dropout(p=0.5),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.model1(x)

def train_func(config):

    model = Model()
    # use ray framework prepare model
    model = ray.train.torch.prepare_model(model)
    criterion = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)

    # Data
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    # local data directory path
    train_data = FashionMNIST(root="C:/projects/python_project/code_scan/data", train=True, download=False, transform=transform)
    train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

    # ray prepare dataloader
    train_loader = ray.train.torch.prepare_data_loader(train_loader)

    # Training
    start_time = datetime.datetime.now()
    for epoch in range(2):
        print(epoch)
        i = 0
        print(len(train_loader))
        for images, labels in train_loader:
            i += 1
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print("epoch: " + str(epoch) + " " + str(len(train_loader)) + '/' + str(i))

        checkpoint_dir = "C:/projects/python_project/code_scan/model"  # local checkpoint_dir path

        checkpoint_path = checkpoint_dir + "/model.checkpoint"
        torch.save(model.state_dict(), checkpoint_path)

        ray.train.report({"loss": loss.item()}, checkpoint=Checkpoint.from_directory(checkpoint_dir))


    end_time = datetime.datetime.now()
    print("total duration: " + str((end_time - start_time).total_seconds()))


# [4] Configure scaling and resource requirements.
scaling_config = ScalingConfig(num_workers=8, use_gpu=False)
# scaling_config = ScalingConfig(num_workers=2, use_gpu=False, resources_per_worker={"CPU": 6})


run_config = RunConfig(storage_path="C:/projects/python_project/code_scan/ray_results")  # local ray results path

# [5] Launch distributed training job.
trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)
result = trainer.fit()

Issue Severity

High: It blocks me from completing my task.

mct2611 avatar Jan 30 '24 09:01 mct2611

import tempfile
import torch
from torchvision.models import resnet18, alexnet
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
import datetime
import ray
    # use ray framework
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, Checkpoint, RunConfig
import torch
from torch import nn
import torch


class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.model1 = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=1, padding=1),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=2, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 256),
            nn.Dropout(p=0.5),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.model1(x)

def train_func(config):

    model = Model()
    # use ray framework prepare model
    model = ray.train.torch.prepare_model(model)
    criterion = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)

    # Data
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    # local data directory path
    train_data = FashionMNIST(root="C:/projects/python_project/code_scan/data", train=True, download=False, transform=transform)
    train_loader = DataLoader(train_data, batch_size=128, shuffle=True)

    # ray prepare dataloader
    train_loader = ray.train.torch.prepare_data_loader(train_loader)

    # Training
    start_time = datetime.datetime.now()
    for epoch in range(2):
        print(epoch)
        i = 0
        print(len(train_loader))
        for images, labels in train_loader:
            i += 1
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print("epoch: " + str(epoch) + " " + str(len(train_loader)) + '/' + str(i))

        checkpoint_dir = "C:/projects/python_project/code_scan/model"  # local checkpoint_dir path

        checkpoint_path = checkpoint_dir + "/model.checkpoint"
        torch.save(model.state_dict(), checkpoint_path)

        ray.train.report({"loss": loss.item()}, checkpoint=Checkpoint.from_directory(checkpoint_dir))


    end_time = datetime.datetime.now()
    print("total duration: " + str((end_time - start_time).total_seconds()))


    # [4] Configure scaling and resource requirements.
scaling_config = ScalingConfig(num_workers=8, use_gpu=False)
    # scaling_config = ScalingConfig(num_workers=2, use_gpu=False, resources_per_worker={"CPU": 6})


run_config = RunConfig(storage_path="C:/projects/python_project/code_scan/ray_results")  # local ray results path

    # [5] Launch distributed training job.
trainer = TorchTrainer(train_func, scaling_config=scaling_config, run_config=run_config)
result = trainer.fit()

mct2611 avatar Jan 30 '24 09:01 mct2611

This seems to be an issue with PyTorch/Windows and not Ray.

matthewdeng avatar Feb 07 '24 18:02 matthewdeng

@mattip can you repro in the coming week?

anyscalesam avatar Feb 14 '24 17:02 anyscalesam

In general, clusters are not supported by ray using windows.

I need more information to reproduce. @mct211 what exactly do you mean by

On one host, the code can keep going on, while on multi node, it will stuck.

How are you running the reproducer script? How are you setting up the cluster hinted at by the error line

(RayTrainWorker pid=48872) [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] \ 
    The client socket has failed to connect to [kubernetes.docker.internal]:65151 (system error: 10049 - 在其上下文中,该请求的地址无效。).

System error 10049 is The requested address is not valid in its context., which would hint at a network configuration problem with the kubernetes.docker.internal node, so as much detail as possible about how you set things up is required in order to help you work through this (unsupported) mode of running ray.

mattip avatar Feb 15 '24 08:02 mattip

@mattip hi, mattip. Thank you for your comments.

"On one host, the code can keep going on, while on multi node, it will stuck." means, if i use only one windows host to run the demo script, the 10049 error will also appear, but the training process can continue and the training process was normal. When i use another windows host to join the cluster( ray start --address='xxxxxx') and then run the code on head node, the error will appear and the running process will stuck, the training process doesn't seem to have started.

I set the environment variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 and the head node execute the command " ray start --head --node-ip-address=localhost --port='xxxx' ", another host node execute the command " ray start --address='xxxxxx' to join the cluster. And then the head node execute the python demo script. Then the error occurred. BTW, i only use the CPU to run the demo script.

Thanks!

mct2611 avatar Feb 19 '24 09:02 mct2611

Sorry, I am a slow learner. Can you provide instructions exactly how to reproduce the problem (only what doesn't work, not what does work)? Something like: On windows computer A I set these environment variables and run this code. Then on computer B I set these other variables and run this other code.

mattip avatar Feb 22 '24 20:02 mattip

@mattip Hi mattip, the windows computer A and windows computer B are on the same LAN. For example, the A's ip is 192.168.1.2, the B's ip is 192.168.1.3. I set the computer A as the head node.

On A, i set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, execute the command "ray start --head --node-ip-address=localhost --port=6666 ".

On B, i set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, execute the command "ray start --address='192.168.1.2:6666' " (B has joined the cluster). Then computer A run this code.

Tips: only use the CPU to run the demo script. In my case, run the "ray status" command, computer A shows 8 cpus, computer B shows 8 cpus too, the cluster shows total 16 cpus. So i set the "ScalingConfig(num_workers=8(>=8), use_gpu=False), try to call the two computers' cpus at the same time.

mct2611 avatar Feb 26 '24 02:02 mct2611

The error is coming from pytorch here. Searching around for such errors in pytorch, I see https://github.com/pytorch/pytorch/issues/77523 with no solution, and https://github.com/pytorch/pytorch/issues/80638 that does have a solution but I am not sure how to apply it to ray.

mattip avatar Mar 03 '24 14:03 mattip

OK, thanks mattip, i'll look into it. If you know how to apply it to ray later, please contact me thanks.

mct2611 avatar Mar 04 '24 03:03 mct2611

OK, thanks mattip, i'll look into it. If you know how to apply it to ray later, please contact me thanks.

Hi, I'm running the same situation Also run two win10 pc use docker in LAN I am not set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1, so the computer B as worker node connect to Head node,but dead about 20s I will try set the variable RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 But did solved your problem, I think i will have the same issue with you

NemoAir avatar Jul 22 '24 02:07 NemoAir