heat icon indicating copy to clipboard operation
heat copied to clipboard

[Bug]: `save_csv` tests does not work on certain multi-node envs.

Open JuanPedroGHM opened this issue 6 months ago • 1 comments

What happened?

The csv related tests fail on multi-node environments, because the temporary directory is not available on both nodes. If that is the case, save_csv should either throw a warning to the user, or collect the data on a single node.

Code snippet triggering the error

if data.comm.rank == 0:
      tmpfile = tempfile.NamedTemporaryFile(
          prefix="test_io_", suffix=".csv", delete=False
      )
      tmpfile.close()
      filename = tmpfile.name
  else:
      filename = None
  filename = data.comm.handle.bcast(filename, root=0)

  data.save(
      filename,
      header_lines=headers,
      sep=separator,
  )

Error message or erroneous outcome

heat never exits because half the ranks are trying to open a file that does not exit.

Version

main (development branch)

Python version

3.11

PyTorch version

2.1

MPI version

OpenMPI 4.1, 5.0

JuanPedroGHM avatar Aug 19 '24 07:08 JuanPedroGHM