heat icon indicating copy to clipboard operation
heat copied to clipboard

[Bug]: DASO Warning `[W socket.cpp:558] [c10d] The client socket cannot be initialized` (Stage 2023)

Open ClaudiaComito opened this issue 2 years ago • 4 comments

What happened?

CC EDIT 9.10.2023 tests do not fail, just a ton of warnings (see below)

CC UPDATED SETTINGS 20.8.2022

Setup:

  • branch release/1.2.x
  • on HDFML:
ml Stages/2022
ml GCC OpenMPI mpi4py PyTorch torchvision HDF5 netCDF
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run:

> salloc --account=haf --nodes=2 --time=00:30:00 --gres=gpu:4
> srun -N 2 --ntasks-per-node=4 --gpus-per-node=4 python -m unittest -vf heat.optim.tests.test_dp_optimizer.TestDASO

Code snippet triggering the error

heat.optim.tests.test_dp_optimizer.TestDASO

Error message or erroneous outcome


test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[...]

Version

1.2.x

Python version

3.9

PyTorch version

1.11

MPI version

OpenMPI with CUDA-aware settings (default in Stage 2022)

ClaudiaComito avatar Aug 13 '22 04:08 ClaudiaComito

Stil open and relevant.

(Reviewed within #1109 )

mrfh92 avatar Aug 17 '23 11:08 mrfh92

In can reproduce this partially. I also get the error message

[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).

but after that, DASO seems to run through:

Finished DASO init
End of Warmup Phase, parameters of next epoch
         Global Skips: 4,  Local Skips 1, Batches to wait: 1
Best loss value: inf Current loss: 0.1439, Worse epochs: 0
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1439,  Worse epochs: 0
Best loss value: 0.0000 Current loss: 0.1275, Worse epochs: 0
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1275,  Worse epochs: 1
Best loss value: 0.0000 Current loss: 0.1239, Worse epochs: 1
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1239,  Worse epochs: 2
Best loss value: 0.0000 Current loss: 0.1226, Worse epochs: 2
dropping skips
        Next Parameters: Global Skips: 2, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1226,  Worse epochs: 0
[...]

In particular, the test does not fail - at least not officially.

(tested on HDFML, 2 nodes, 4 tasks per node, 4 GPUs per node, 6 CPUs per task, with modules ml Stages/2023 GCC CUDA OpenMPI mpi4py PyTorch torchvision HDF5 h5py netCDF NCCL cuDNN cuTENSOR numba)

@ClaudiaComito Is this the same you observed, or was that another error?

mrfh92 avatar Sep 04 '23 14:09 mrfh92

@coquelin77 I think you're the expert on this. Is the described behaviour as expected or is something going wrong here?

mrfh92 avatar Oct 11 '23 09:10 mrfh92