storage icon indicating copy to clipboard operation
storage copied to clipboard

Assigning different accelerator count per host does not work

Open sajikvr opened this issue 6 months ago • 10 comments

Assigning different accelerator count per host, where the total number of accelerators is not divisible by host count (for eg, 9 accelerators, 2 hosts, 5 & 4 accelerators on each host) does not work. If the accelerator count is divisible by host count, it works,

Specifically following command fails

mlpstorage training run --hosts newmlvm1:5,newmlvm2:4 --model unet3d --data-dir /mnt/training/ --params reader.read_threads=20 dataset.num_files_train=25000 dataset.num_subfolders_train=4 reader.odirect=true checkpoint.checkpoint_folder=/mnt/training --client-host-memory-in-gb 192 --num-accelerators 9 --accelerator-type h100 --checkpoint-folder /mnt/training

Error is

  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 459, in run_benchmark
    benchmark = DLIOBenchmark(cfg['workload'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 153, in __init__
    self.stats = StatsCounter()
                 ^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py", line 105, in __init__
    self.MPI.comm().Reduce(host_memory, host_memory_agg, op=MPI.SUM, root=0)
  File "src/mpi4py/MPI.src/Comm.pyx", line 1100, in mpi4py.MPI.Comm.Reduce
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 459, in run_benchmark
    benchmark = DLIOBenchmark(cfg['workload'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 153, in __init__
    self.stats = StatsCounter()
                 ^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py", line 105, in __init__
    self.MPI.comm().Reduce(host_memory, host_memory_agg, op=MPI.SUM, root=0)
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
  File "src/mpi4py/MPI.src/Comm.pyx", line 1100, in mpi4py.MPI.Comm.Reduce
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

Following one works.

mlpstorage training run --hosts newmlvm1:4,newmlvm2:6 --model unet3d --data-dir /mnt/training/ --params reader.read_threads=20 dataset.num_files_train=25000 dataset.num_subfolders_train=4 reader.odirect=true checkpoint.checkpoint_folder=/mnt/training --client-host-memory-in-gb 192 --num-accelerators 10 --accelerator-type h100 --checkpoint-folder /mnt/training

sajikvr avatar Jun 16 '25 20:06 sajikvr

This is a result of the definition of "host node" that we used in v1.0. You can argue for changing the below definition, but the fact that the code doesn't work for different accelerator counts isn't relevant to submissions since we're not supposed to do that.

A host node is defined as the minimum unit by which the load upon the storage system under test can be increased. Every host node must run the same number of simulated accelerators. A host node can be instantiated by running the MLPerf Storage benchmark code within a Container or within a VM guest image or natively within an entire physical system. The number of Containers or VM guest images per physical system and the CPU resources per host node is up to the submitter. Note that the maximum DRAM available to any host node must be used when calculating the dataset size to be generated for the test.

FileSystemGuy avatar Jun 17 '25 00:06 FileSystemGuy

@FileSystemGuy Thought the reason for adding the custom Accelerator distribution is to allow heterogeneous hosts to give more flexibility for testing, isn't it ?

sajikvr avatar Jun 17 '25 01:06 sajikvr

It could be, my memory is fuzzy on the history right now (I just got back from ISC in Germany and jet lag is killing me). My comment above came from just having (re)read the rules doc and seeing the passage above. You can make the argument to the peer review group that we need to change the rules for this case.

FileSystemGuy avatar Jun 17 '25 01:06 FileSystemGuy

We meet the same error when running checkpoting 70B model with 64 progresses. We have limited hardware hosts and each host has only 256GB memory(required total 912GB total). it's very difficult to gather enough hosts that satisfy the common divisor of 64 (1/2/4/8/16/32/64).

txu2k8 avatar Jun 18 '25 07:06 txu2k8

Ok, we talked about this in detail at the rules review meeting this morning. The core problem is that DLIO asks MPI to do an "all reduce" operation across all of the nodes in the MPI session to collect the results of the run. If some of those nodes have different numbers of accelerators, the buffers holding those results in those nodes are of a different size than the buffers in the other nodes, which MPI doesn't know how to process.

This code is common to all training and checkpoint workloads, but is only for reporting the results, not for the simulated workloads themselves. It did not come up before now because the Training workloads require equal numbers of accelerators per host, but the new Checkpoint workloads do not. Since the checkpoint workloads require 64, 128, 512, etc, simulated accelerators requiring an equal number of simulated accelerators per host is simply too chunky, it doesn't divide well into the typical numbers of hosts in performance labs at storage vendors.

Huihuo is still investigating the code and will generate a patch you will be able to apply to DLIO. In the short-term the only thing we can suggest is using the same number of accelerators per host. Especially for the Training workloads, where doing a few final runs with an additional couple of accelerators once you have the patch is straightforward.

FileSystemGuy avatar Jun 18 '25 16:06 FileSystemGuy

@sajikvr @txu2k8 , could you please test the branch and let me know?

I fixed this in DLIO. You'll have to use the specific branch: bugfix/inhomogeneous_setup in DLIO by running the following command.

pip install --upgrade \
  git+https://github.com/argonne-lcf/dlio_benchmark.git@bugfix/inhomogeneous_setup#egg=dlio_benchmark

Please let me know how it works. Put your comment here: https://github.com/argonne-lcf/dlio_benchmark/pull/299

zhenghh04 avatar Jun 19 '25 14:06 zhenghh04

@zhenghh04 Could not get the patch installed.

(myenv) nutanix@clientvm1:~$ pip install --upgrade \
  git+https://github.com/argonne-lcf/dlio_benchmark.git@bugfix/inhomogeneous_setup#egg=dlio_benchmark
Collecting dlio_benchmark
  Cloning https://github.com/argonne-lcf/dlio_benchmark.git (to revision bugfix/inhomogeneous_setup) to /tmp/pip-install-ll2qfvrp/dlio-benchmark_96241176733f44e697bbc0241ba7fc41
  Running command git clone --filter=blob:none --quiet https://github.com/argonne-lcf/dlio_benchmark.git /tmp/pip-install-ll2qfvrp/dlio-benchmark_96241176733f44e697bbc0241ba7fc41
  WARNING: Did not find branch or tag 'bugfix/inhomogeneous_setup', assuming revision or ref.
  Running command git checkout -q bugfix/inhomogeneous_setup
  error: pathspec 'bugfix/inhomogeneous_setup' did not match any file(s) known to git
  error: subprocess-exited-with-error

  × git checkout -q bugfix/inhomogeneous_setup did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git checkout -q bugfix/inhomogeneous_setup did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I tried to manually patch the files, but see this while running

mlpstorage training run --hosts 10.57.205.81:51,10.57.205.82:49 --model resnet50 --data-dir /mnt/data --params reader.read_threads=2 dataset.num_files_train=19001 dataset.num_subfolders_train=100 checkpoint.checkpoint_folder=/mnt/data --client-host-memory-in-gb 256 --num-accelerators 100 --accelerator-type h100 --checkpoint-folder /mnt/data --allow-run-as-root

Error executing job with overrides: ['workload=resnet50_h100', '++workload.reader.read_threads=2', '++workload.dataset.num_files_train=19001', '++workload.dataset.num_subfolders_train=100', '++workload.checkpoint.checkpoint_folder=/mnt/data', '++workload.dataset.data_folder=/mnt/data/resnet50']
Traceback (most recent call last):
  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 459, in run_benchmark
    benchmark = DLIOBenchmark(cfg['workload'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 153, in __init__
    self.stats = StatsCounter()
                 ^^^^^^^^^^^^^^
  File "/home/nutanix/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py", line 105, in __init__
    self.MPI.comm().Reduce(host_memory, host_memory_agg, op=MPI.SUM, root=0)
  File "src/mpi4py/MPI.src/Comm.pyx", line 1100, in mpi4py.MPI.Comm.Reduce
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

sajikvr avatar Jun 24 '25 19:06 sajikvr

Tried again, looks like it is working, could use 5 clients to host 19 gpus

mlpstorage training run --hosts 10.57.205.80:4,10.57.205.82:4,10.57.205.84:4,10.57.205.85:4,10.57.205.86:3 --model unet3d --data-dir /mnt/data --params reader.read_threads=10 dataset.num_files_train=70015 dataset.num_subfolders_train=200 checkpoint.checkpoint_folder=/mnt/data reader.odirect=true --client-host-memory-in-gb 256 --num-accelerators 19 --accelerator-type h100 --checkpoint-folder /mnt/data --allow-run-as-root

sajikvr avatar Jun 25 '25 20:06 sajikvr

@johnugeorge @FileSystemGuy , I plan to merge this to mlperf_storage_v2.0 in DLIO. Let me know if you feel otherwise.

zhenghh04 avatar Jun 26 '25 02:06 zhenghh04

@zhenghh04 As we discussed, we will make this fix optional for submitters. We will not merge it till the end of the release cycle.

johnugeorge avatar Jul 01 '25 12:07 johnugeorge