Guidance on Implementing CPU Binding for UNet3D (PyTorch) with MPI
I recently joined the MLCommons group and have been looking into UNet3D's performance. It was mentioned during our discussions that UNet3D does not fully utilize bandwidth, potentially due to being the only model implemented in PyTorch.
To address this, I understand we may need to implement CPU binding via MPI. I wanted to reach out to ask:
- Has anyone in the group already explored or implemented CPU binding for UNet3D?
- Are there recommended practices or existing examples that I could refer to?
- Should this be done within the training script or as part of the MPI launch configuration?
Any guidance, documentation, or pointers would be greatly appreciated!
You can try --cpu-bind depth -d num_workers where num_workers is the number of read threads.
@zhenghh04 would you mind sharing an example?
For example, if you are using 4 read threads, you can do
mpiexec --cpu-bind depth -d 4
https://www.open-mpi.org/doc/v3.0/man1/mpiexec.1.php
For process binding instruction
--bind-to <foo>
Bind processes to the specified object, defaults to core. Supported options include slot, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board, and none.
-cpus-per-proc, --cpus-per-proc <#perproc>
Bind each process to the specified number of cpus. (deprecated in favor of --map-by <obj>:PE=n)
-cpus-per-rank, --cpus-per-rank <#perrank>
Alias for -cpus-per-proc. (deprecated in favor of --map-by <obj>:PE=n)
-bind-to-core, --bind-to-core
Bind processes to cores (deprecated in favor of --bind-to core)
-bind-to-socket, --bind-to-socket
Bind processes to processor sockets (deprecated in favor of --bind-to socket)
-report-bindings, --report-bindings
Report any bindings for launched processes.
You can try the following. The main thing to avoid is all the reader threads are on the same core.
mpiexec --bind-to-core --cpu-per-proc 4 -report-bindings
@zhenghh04 how could we incorporate in the mlpstorage command:
for example changing reader threads I am using reader.threads:
mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 100 --num-accelerators 8 --accelerator-type h100 --model unet3d --data-dir /mnt/june12/data/ --results-dir /mnt/june12/result --checkpoint-folder False --param dataset.num_files_train=28000 reader.read_threads=8
Right now, you can print out the mpiexec command from mlpstorage, --what-is, then manually add that in the mpiexec command. Run the mpiexec command but not mlpstorage. We maybe able to modify mlpstorage to allow extra flags to be given
@rodrigonascimento @garimajain05
The main branch has an option of passing extra mpi flags: https://github.com/mlcommons/storage/blob/30cac25d5dcdac16715f13f48c85e24af4ae8b2b/mlpstorage/cli.py#L256C1-L257C1
Could you try the following setup:
--mpi-params --bind-to-core --cpu-per-proc 8
--cpu-per-proc should not be smaller than read_threads value.
Hi @zhenghh04 ,the checkpointing datasize subcommand has no "args.mpi_params",Bug it used in function update_args:
cli.py `
Line 330:
def add_checkpointing_arguments(checkpointing_parsers): .... if _parser == run_benchmark: _parser.add_argument('--exec-type', '-et', type=EXEC_TYPE, choices=list(EXEC_TYPE), default=EXEC_TYPE.MPI, help=help_messages['exec_type']) add_mpi_group(_parser)
Line 445:
def update_args(args): .... if args.mpi_params: flattened_mpi_params = [item for sublist in args.mpi_params for item in sublist] setattr(args,'mpi_params', flattened_mpi_params) `
This will lead to the following error:
mlpstorage checkpointing datasize --hosts 10.1.1.1 10.1.1.2 --model llama3-70b --loops 1 --client-host-memory-in-gb 2048 --num-processes 8 --checkpoint-folder /mnt/mlperf/llama3-70b_w10_r10_p8_m2048 --results-dir /mnt/mlperf/result --closed
Traceback (most recent call last):
File "/root/.venvs/MLPerfStorageV2/bin/mlpstorage", line 8, in
@garimajain05 you only need to set those --mpi-params --bind-to-core --cpu-per-proc 8 when you doing training or checkpointing run,
For datasize , you don't have to set those.
mlpstorage training run --hosts 10.236.203.208 --num-client-hosts 1 --client-host-memory-in-gb 196 --num-accelerators 16 --accelerator-type h100 --model unet3d --data-dir /mnt/lustrePF/unet3d_data --results-dir /mnt/lustrePF/unet3d_results --param dataset.num_files_train=28000 --mpi-params --bind-to-core --cpu-per-proc 8
Could anyone please give the correct syntax here ?
O/P: mlpstorage training run: error: argument --mpi-params: expected at least one argumen