fincherjc
fincherjc
Hello @botbw I do not expect this change to impact the 'training run' command. As far as I know, the intended design of the benchmark is to specify the maximum...
I think this is somewhat answered issue #202 but to clarify here, mpi bind options are not defined by default in v2.0 and fallback to [openmpi defaults](https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man1/mpirun.1.html#quick-summary). You can add...
I found this message and performance problems while running mlpstorage training runs with a small environment. As far as I could find, DLIO wants 4 threads per accelerator to drive...
For my purposes, I was using a single client and the v2.0 branch: `nohup mlpstorage training run --model unet3d --client-host-memory-in-gb --exec-type=mpi --num-accelerators --accelerator-type h100 --num-client-hosts --data-dir --param reader.odirect=true reader.read_threads= dataset.num_files_train=...
@xdreamcoder the main branch of this repo has been updated to give a new default mpi bind and map parameter that should address this. It should mirror what you get...
@xdreamcoder Can you file this as a new issue to keep tracking clean?
@xdreamcoder The traceback you've posted here hints that you've exceeded the maximum number of files your host can open concurrently (OSERROR - Too many open files). This isn't inherently a...