storage icon indicating copy to clipboard operation
storage copied to clipboard

Guidance on Implementing CPU Binding for UNet3D (PyTorch) with MPI

Open garimajain05 opened this issue 7 months ago • 10 comments

I recently joined the MLCommons group and have been looking into UNet3D's performance. It was mentioned during our discussions that UNet3D does not fully utilize bandwidth, potentially due to being the only model implemented in PyTorch.

To address this, I understand we may need to implement CPU binding via MPI. I wanted to reach out to ask:

  • Has anyone in the group already explored or implemented CPU binding for UNet3D?
  • Are there recommended practices or existing examples that I could refer to?
  • Should this be done within the training script or as part of the MPI launch configuration?

Any guidance, documentation, or pointers would be greatly appreciated!

garimajain05 avatar Jun 09 '25 22:06 garimajain05

You can try --cpu-bind depth -d num_workers where num_workers is the number of read threads.

zhenghh04 avatar Jun 17 '25 17:06 zhenghh04

@zhenghh04 would you mind sharing an example?

rodrigonascimento avatar Jun 17 '25 17:06 rodrigonascimento

For example, if you are using 4 read threads, you can do

mpiexec --cpu-bind depth -d 4

zhenghh04 avatar Jun 17 '25 17:06 zhenghh04

https://www.open-mpi.org/doc/v3.0/man1/mpiexec.1.php

For process binding instruction

--bind-to <foo>
Bind processes to the specified object, defaults to core. Supported options include slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none.
-cpus-per-proc, --cpus-per-proc <#perproc>
Bind each process to the specified number of cpus. (deprecated in favor of --map-by <obj>:PE=n)
-cpus-per-rank, --cpus-per-rank <#perrank>
Alias for -cpus-per-proc. (deprecated in favor of --map-by <obj>:PE=n)
-bind-to-core, --bind-to-core
Bind processes to cores (deprecated in favor of --bind-to core)
-bind-to-socket, --bind-to-socket
Bind processes to processor sockets (deprecated in favor of --bind-to socket)
-report-bindings, --report-bindings
Report any bindings for launched processes.

You can try the following. The main thing to avoid is all the reader threads are on the same core.

mpiexec --bind-to-core --cpu-per-proc 4 -report-bindings

zhenghh04 avatar Jun 17 '25 17:06 zhenghh04

@zhenghh04 how could we incorporate in the mlpstorage command: for example changing reader threads I am using reader.threads: mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 100 --num-accelerators 8 --accelerator-type h100 --model unet3d --data-dir /mnt/june12/data/ --results-dir /mnt/june12/result --checkpoint-folder False --param dataset.num_files_train=28000 reader.read_threads=8

garimajain05 avatar Jun 17 '25 17:06 garimajain05

Right now, you can print out the mpiexec command from mlpstorage, --what-is, then manually add that in the mpiexec command. Run the mpiexec command but not mlpstorage. We maybe able to modify mlpstorage to allow extra flags to be given

zhenghh04 avatar Jun 18 '25 14:06 zhenghh04

@rodrigonascimento @garimajain05

The main branch has an option of passing extra mpi flags: https://github.com/mlcommons/storage/blob/30cac25d5dcdac16715f13f48c85e24af4ae8b2b/mlpstorage/cli.py#L256C1-L257C1

Could you try the following setup:

--mpi-params --bind-to-core --cpu-per-proc 8

--cpu-per-proc should not be smaller than read_threads value.

zhenghh04 avatar Jun 19 '25 16:06 zhenghh04

Hi @zhenghh04 ,the checkpointing datasize subcommand has no "args.mpi_params",Bug it used in function update_args:

cli.py `

Line 330:

def add_checkpointing_arguments(checkpointing_parsers): .... if _parser == run_benchmark: _parser.add_argument('--exec-type', '-et', type=EXEC_TYPE, choices=list(EXEC_TYPE), default=EXEC_TYPE.MPI, help=help_messages['exec_type']) add_mpi_group(_parser)

Line 445:

def update_args(args): .... if args.mpi_params: flattened_mpi_params = [item for sublist in args.mpi_params for item in sublist] setattr(args,'mpi_params', flattened_mpi_params) `

This will lead to the following error: mlpstorage checkpointing datasize --hosts 10.1.1.1 10.1.1.2 --model llama3-70b --loops 1 --client-host-memory-in-gb 2048 --num-processes 8 --checkpoint-folder /mnt/mlperf/llama3-70b_w10_r10_p8_m2048 --results-dir /mnt/mlperf/result --closed Traceback (most recent call last): File "/root/.venvs/MLPerfStorageV2/bin/mlpstorage", line 8, in sys.exit(main()) File "/root/.venvs/MLPerfStorageV2/lib/python3.10/site-packages/mlpstorage/main.py", line 111, in main update_args(args) File "/root/.venvs/MLPerfStorageV2/lib/python3.10/site-packages/mlpstorage/cli.py", line 445, in update_args if args.mpi_params: AttributeError: 'Namespace' object has no attribute 'mpi_params'

txu2k8 avatar Jun 26 '25 06:06 txu2k8

@garimajain05 you only need to set those --mpi-params --bind-to-core --cpu-per-proc 8 when you doing training or checkpointing run, For datasize , you don't have to set those.

zhenghh04 avatar Jun 26 '25 16:06 zhenghh04

mlpstorage training run --hosts 10.236.203.208 --num-client-hosts 1 --client-host-memory-in-gb 196 --num-accelerators 16 --accelerator-type h100 --model unet3d --data-dir /mnt/lustrePF/unet3d_data --results-dir /mnt/lustrePF/unet3d_results --param dataset.num_files_train=28000 --mpi-params --bind-to-core --cpu-per-proc 8

Could anyone please give the correct syntax here ?

O/P: mlpstorage training run: error: argument --mpi-params: expected at least one argumen

kailasgoliwadekar avatar Sep 16 '25 14:09 kailasgoliwadekar