storage icon indicating copy to clipboard operation
storage copied to clipboard

What it mean ? : [WARNING] Running DLIO with 32 threads for I/O but core available 2 are insufficient and can lead to lower performance.

Open xdreamcoder opened this issue 3 months ago • 7 comments

Why is this error occurring, and how can I fix it? The number of my cores checked through the "nproc" command is 512 (Hyper Thread option On)

xdreamcoder avatar Sep 23 '25 08:09 xdreamcoder

I found this message and performance problems while running mlpstorage training runs with a small environment. As far as I could find, DLIO wants 4 threads per accelerator to drive workload by default (more if you override the thread parameters). However, depending on your mlpstorage parameters and node count, the process to simulate an accelerator may be limited to 1 core (per accelerator). I managed resolve this by overriding the mpi bind-to and map-by parameters.

fincherjc avatar Sep 25 '25 13:09 fincherjc

@fincherjc, Thanks for your answer. Can you tell me the full run cmd you ran on how you set up parameters? That would be a great help to me. Thank you.

xdreamcoder avatar Sep 26 '25 03:09 xdreamcoder

For my purposes, I was using a single client and the v2.0 branch:

nohup mlpstorage training run --model unet3d --client-host-memory-in-gb <MEM> --exec-type=mpi --num-accelerators <N> --accelerator-type h100 --num-client-hosts <N> --data-dir <DIR> --param reader.odirect=true reader.read_threads=<NPROC> dataset.num_files_train=<DATASIZE> --oversubscribe &

The current benchmark does not specify any mpi bind-to or map-by parameters, so these fall back to openmpi defaults which will vary depending on the number of accelerators and clients you define. Adding --oversubscribe changes the bind-to behavior to "none" (meaning all CPU's can be used at direction of OS scheduler).

With this running in the background, you can verify in nohup.out that the warning is not present. You should also see in mpstat -P ALL 5 output that more CPU is active compared to what you ran to get this warning.

fincherjc avatar Sep 26 '25 13:09 fincherjc

@xdreamcoder the main branch of this repo has been updated to give a new default mpi bind and map parameter that should address this. It should mirror what you get when incorporating --oversubscribe. Can you check and confirm you've gotten past this issue?

fincherjc avatar Oct 01 '25 18:10 fincherjc

@fincherjc, Thank you for your quick help and fix. The issue #201 and #202 that I gave you has been confirmed to be resolved. However, my CPU's thread count is 512(I checked through the nproc command), : 512 = 128core x 2 sockets x2(HyperThreading on) it seems that the number of read_threads can only be allocated up to 128. If this happens, I think it will be treated with the same performance as the CPU of the 1-socket + HyperThreading off option. Could you check this problem?

Image Image

xdreamcoder avatar Oct 02 '25 08:10 xdreamcoder

@xdreamcoder Can you file this as a new issue to keep tracking clean?

fincherjc avatar Oct 02 '25 13:10 fincherjc

@fincherjc At your request, I created a new issue (github.com/mlcommons/storage/issues/205)

xdreamcoder avatar Oct 13 '25 01:10 xdreamcoder