physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

🐛[BUG]: ERA5 DALI datapipe hangs indefinitely in multi-GPU/multi-Node setting if the datapipe size is not selected correctly.

Open ktangsali opened this issue 2 years ago • 0 comments

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

This can mostly be fixed by modifying the number of samples in the datapipe (for example here) to be divisible by the number of processors/GPUs.

A long term fix would be to automatically avoid failure cases where the size is not exactly divisible by the number of GPUs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

ktangsali avatar Aug 02 '23 21:08 ktangsali