heat
heat copied to clipboard
[Bug]: `test_vmap` fails on multi-node runs on hardware accelerators
What happened?
When running on more than one node and using GPUs at the same time, test_vmap
fails. Needs further investigation.
Code snippet triggering the error
When running the test on Horeka using accelerated nodes, the test fails when running the test on 2 Nodes, with 3 or 4 ranks each.
HEAT_TEST_USE_DEVICE=gpu mpirun --report-bindings -N 3/4 pytest heat/core/tests/test_vmap.py
Error message or erroneous outcome
The result of the test does not match the expected outcome.
FAILED heat/core/tests/test_vmap.py::TestVmap::test_vmap - AssertionError: False is not true
Version
main (development branch)
Python version
3.11.2
PyTorch version
2.2.2
Cuda version
12.2
MPI version
OpenMPI 4.1, 5.0
mpi4py 3.1.6, 4.0.0