heat
heat copied to clipboard
[Bug]: `save_csv` tests does not work on certain multi-node envs.
What happened?
The csv
related tests fail on multi-node environments, because the temporary directory is not available on both nodes. If that is the case, save_csv
should either throw a warning to the user, or collect the data on a single node.
Code snippet triggering the error
if data.comm.rank == 0:
tmpfile = tempfile.NamedTemporaryFile(
prefix="test_io_", suffix=".csv", delete=False
)
tmpfile.close()
filename = tmpfile.name
else:
filename = None
filename = data.comm.handle.bcast(filename, root=0)
data.save(
filename,
header_lines=headers,
sep=separator,
)
Error message or erroneous outcome
heat never exits because half the ranks are trying to open a file that does not exit.
Version
main (development branch)
Python version
3.11
PyTorch version
2.1
MPI version
OpenMPI 4.1, 5.0