heat
heat copied to clipboard
[Bug]: DASO Warning `[W socket.cpp:558] [c10d] The client socket cannot be initialized` (Stage 2023)
What happened?
CC EDIT 9.10.2023 tests do not fail, just a ton of warnings (see below)
CC UPDATED SETTINGS 20.8.2022
Setup:
- branch
release/1.2.x
- on HDFML:
ml Stages/2022
ml GCC OpenMPI mpi4py PyTorch torchvision HDF5 netCDF
export CUDA_VISIBLE_DEVICES=0,1,2,3
Run:
> salloc --account=haf --nodes=2 --time=00:30:00 --gres=gpu:4
> srun -N 2 --ntasks-per-node=4 --gpus-per-node=4 python -m unittest -vf heat.optim.tests.test_dp_optimizer.TestDASO
Code snippet triggering the error
heat.optim.tests.test_dp_optimizer.TestDASO
Error message or erroneous outcome
test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... test_daso (heat.optim.tests.test_dp_optimizer.TestDASO) ... [W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[...]
Version
1.2.x
Python version
3.9
PyTorch version
1.11
MPI version
OpenMPI with CUDA-aware settings (default in Stage 2022)
Stil open and relevant.
(Reviewed within #1109 )
In can reproduce this partially. I also get the error message
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
but after that, DASO seems to run through:
Finished DASO init
End of Warmup Phase, parameters of next epoch
Global Skips: 4, Local Skips 1, Batches to wait: 1
Best loss value: inf Current loss: 0.1439, Worse epochs: 0
Next Parameters: Global Skips: 4, Local Skips 1, Batches to wait: 1,
Current loss: 0.1439, Worse epochs: 0
Best loss value: 0.0000 Current loss: 0.1275, Worse epochs: 0
Next Parameters: Global Skips: 4, Local Skips 1, Batches to wait: 1,
Current loss: 0.1275, Worse epochs: 1
Best loss value: 0.0000 Current loss: 0.1239, Worse epochs: 1
Next Parameters: Global Skips: 4, Local Skips 1, Batches to wait: 1,
Current loss: 0.1239, Worse epochs: 2
Best loss value: 0.0000 Current loss: 0.1226, Worse epochs: 2
dropping skips
Next Parameters: Global Skips: 2, Local Skips 1, Batches to wait: 1,
Current loss: 0.1226, Worse epochs: 0
[...]
In particular, the test does not fail - at least not officially.
(tested on HDFML, 2 nodes, 4 tasks per node, 4 GPUs per node, 6 CPUs per task, with modules ml Stages/2023 GCC CUDA OpenMPI mpi4py PyTorch torchvision HDF5 h5py netCDF NCCL cuDNN cuTENSOR numba
)
@ClaudiaComito Is this the same you observed, or was that another error?
@coquelin77 I think you're the expert on this. Is the described behaviour as expected or is something going wrong here?