physicsnemo
physicsnemo copied to clipboard
🐛[BUG]: ERA5 DALI datapipe hangs indefinitely in multi-GPU/multi-Node setting if the datapipe size is not selected correctly.
Version
0.2.0
On which installation method(s) does this occur?
Docker
Describe the issue
This can mostly be fixed by modifying the number of samples in the datapipe (for example here) to be divisible by the number of processors/GPUs.
A long term fix would be to automatically avoid failure cases where the size is not exactly divisible by the number of GPUs.
Minimum reproducible example
No response
Relevant log output
No response
Environment details
No response