heat [Bug]: DASO Warning `[W socket.cpp:558] [c10d] The client socket cannot be initialized` (Stage 2023)

What happened?

CC EDIT 9.10.2023 tests do not fail, just a ton of warnings (see below)

CC UPDATED SETTINGS 20.8.2022

Setup:

branch release/1.2.x
on HDFML:

ml Stages/2022
ml GCC OpenMPI mpi4py PyTorch torchvision HDF5 netCDF
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run:

> salloc --account=haf --nodes=2 --time=00:30:00 --gres=gpu:4
> srun -N 2 --ntasks-per-node=4 --gpus-per-node=4 python -m unittest -vf heat.optim.tests.test_dp_optimizer.TestDASO

Code snippet triggering the error

heat.optim.tests.test_dp_optimizer.TestDASO

Error message or erroneous outcome


test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[...]

Version

1.2.x

Python version

3.9

PyTorch version

1.11

MPI version

OpenMPI with CUDA-aware settings (default in Stage 2022)

Aug 13 '22 04:08 ClaudiaComito

Stil open and relevant.

(Reviewed within #1109 )

Aug 17 '23 11:08 mrfh92

In can reproduce this partially. I also get the error message

[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).

but after that, DASO seems to run through:

Finished DASO init
End of Warmup Phase, parameters of next epoch
         Global Skips: 4,  Local Skips 1, Batches to wait: 1
Best loss value: inf Current loss: 0.1439, Worse epochs: 0
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1439,  Worse epochs: 0
Best loss value: 0.0000 Current loss: 0.1275, Worse epochs: 0
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1275,  Worse epochs: 1
Best loss value: 0.0000 Current loss: 0.1239, Worse epochs: 1
        Next Parameters: Global Skips: 4, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1239,  Worse epochs: 2
Best loss value: 0.0000 Current loss: 0.1226, Worse epochs: 2
dropping skips
        Next Parameters: Global Skips: 2, Local Skips 1,  Batches to wait: 1,
        Current loss: 0.1226,  Worse epochs: 0
[...]

In particular, the test does not fail - at least not officially.

(tested on HDFML, 2 nodes, 4 tasks per node, 4 GPUs per node, 6 CPUs per task, with modules ml Stages/2023 GCC CUDA OpenMPI mpi4py PyTorch torchvision HDF5 h5py netCDF NCCL cuDNN cuTENSOR numba)

@ClaudiaComito Is this the same you observed, or was that another error?

Sep 04 '23 14:09 mrfh92

@coquelin77 I think you're the expert on this. Is the described behaviour as expected or is something going wrong here?

Oct 11 '23 09:10 mrfh92

heat heat copied to clipboard

[Bug]: DASO Warning `[W socket.cpp:558] [c10d] The client socket cannot be initialized` (Stage 2023)

What happened?

CC EDIT 9.10.2023 tests do not fail, just a ton of warnings (see below)

CC UPDATED SETTINGS 20.8.2022

Code snippet triggering the error

Error message or erroneous outcome

Version

Python version

PyTorch version

MPI version

heat
heat copied to clipboard