heat icon indicating copy to clipboard operation
heat copied to clipboard

[Bug]: `test_vmap` fails on multi-node runs on hardware accelerators

Open JuanPedroGHM opened this issue 6 months ago • 3 comments

What happened?

When running on more than one node and using GPUs at the same time, test_vmap fails. Needs further investigation.

Code snippet triggering the error

When running the test on Horeka using accelerated nodes, the test fails when running the test on 2 Nodes, with 3 or 4 ranks each.

HEAT_TEST_USE_DEVICE=gpu mpirun --report-bindings -N 3/4 pytest heat/core/tests/test_vmap.py

Error message or erroneous outcome

The result of the test does not match the expected outcome.

FAILED heat/core/tests/test_vmap.py::TestVmap::test_vmap - AssertionError: False is not true

Version

main (development branch)

Python version

3.11.2

PyTorch version

2.2.2

Cuda version

12.2

MPI version

OpenMPI 4.1, 5.0
mpi4py 3.1.6, 4.0.0

JuanPedroGHM avatar Aug 19 '24 07:08 JuanPedroGHM