storage icon indicating copy to clipboard operation
storage copied to clipboard

Running resnet5 with more than 5 hosts fails

Open sajikvr opened this issue 6 months ago • 4 comments

Command used

mlpstorage training run --hosts 10.57.205.80,10.57.205.86,10.57.205.87,10.57.205.84,10.57.205.82,10.57.205.85 --model resnet50 --data-dir /mnt/data --params reader.read_threads=4 dataset.num_files_train=70000 dataset.num_subfolders_train=200 checkpoint.checkpoint_folder=/mnt/data --client-host-memory-in-gb 256 --num-accelerators 200 --accelerator-type h100 --checkpoint-folder /mnt/data --allow-run-as-root

Error

2025-06-25 20:59:16|STATUS: Running benchmark command:: mpirun -n 200 -host 10.57.205.80:34,10.57.205.86:34,10.57.205.87:33,10.57.205.84:33,10.57.205.82:33,10.57.205.85:33 --allow-run-as-root /home/nutanix/.venvs/myenv/bin/dlio_benchmark workload=resnet50_h100 ++hydra.run.dir=/tmp/mlperf_storage_results/training/resnet50/run/20250625_205916 ++hydra.output_subdir=dlio_config ++workload.reader.read_threads=4 ++workload.dataset.num_files_train=70000 ++workload.dataset.num_subfolders_train=200 ++workload.checkpoint.checkpoint_folder=/mnt/data ++workload.dataset.data_folder=/mnt/data/resnet50 --config-dir=/home/nutanix/storage/configs/dlio
[clientvm1:148878] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1935
[clientvm1:148878] *** Process received signal ***
[clientvm1:148878] Signal: Segmentation fault (11)
[clientvm1:148878] Signal code: Address not mapped (1)
[clientvm1:148878] Failing at address: 0x28
[clientvm1:148878] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x79be18c45330]
[clientvm1:148878] [ 1] /lib/x86_64-linux-gnu/libpmix.so.2(+0x13a8aa)[0x79be15f3a8aa]
[clientvm1:148878] [ 2] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_store_modex+0x8f3)[0x79be15f36013]
[clientvm1:148878] [ 3] /lib/x86_64-linux-gnu/libpmix.so.2(+0x13a423)[0x79be15f3a423]
[clientvm1:148878] [ 4] /lib/x86_64-linux-gnu/libpmix.so.2(+0x7a99b)[0x79be15e7a99b]
[clientvm1:148878] [ 5] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x1f2a8)[0x79be18ecf2a8]
[clientvm1:148878] [ 6] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x4af)[0x79be18ed0faf]
[clientvm1:148878] [ 7] /lib/x86_64-linux-gnu/libpmix.so.2(+0xa8eb1)[0x79be15ea8eb1]
[clientvm1:148878] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x79be18c9caa4]
[clientvm1:148878] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x79be18d29c3c]
[clientvm1:148878] *** End of error message ***
[clientvm6:136893] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1935

upto 5 hosts works without any issues. Also it is not specific some client, number of client is what is causing it. If I used 7 clients, still fails but with a slighlty different error

2025-06-25 21:00:41|STATUS: Running benchmark command:: mpirun -n 200 -host 10.57.205.80:29,10.57.205.86:29,10.57.205.87:29,10.57.205.84:29,10.57.205.82:28,10.57.205.85:28,10.57.205.88:28 --allow-run-as-root /home/nutanix/.venvs/myenv/bin/dlio_benchmark workload=resnet50_h100 ++hydra.run.dir=/tmp/mlperf_storage_results/training/resnet50/run/20250625_210041 ++hydra.output_subdir=dlio_config ++workload.reader.read_threads=4 ++workload.dataset.num_files_train=70000 ++workload.dataset.num_subfolders_train=200 ++workload.checkpoint.checkpoint_folder=/mnt/data ++workload.dataset.data_folder=/mnt/data/resnet50 --config-dir=/home/nutanix/storage/configs/dlio
[clientvm1:150139] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm7:85200] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm8:10748] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm2:272631] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm4:241681] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm5:281061] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
[clientvm6:138170] PMIX ERROR: PMIX_ERR_NOT_SUPPORTED in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 240
2025-06-25 21:00:44|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/resnet50/run/20250625_210041/training_20250625_210041_metadata.json
(myenv) nutanix@clientvm1:/mnt/data/resnet50/train$

sajikvr avatar Jun 25 '25 21:06 sajikvr

used this workaround, which seems to be working

export PMIX_MCA_gds=hash

sajikvr avatar Jun 25 '25 21:06 sajikvr

i set export PMIX_MCA_gds=hash, but found another error:

[OUTPUT] 2025-07-07T11:21:21.557515 Running DLIO [Training] with 1024 process(es)
Error executing job with overrides: ['workload=resnet50_h100', '++workload.dataset.num_files_train=239542', '++workload.dataset.num_subfolders_train=24', '++workload.reader.read_threads=1', '++workload.dataset.data_folder=/mnt/alluxio/alluxio-fuse/s3/resnet50_239542_31999GB/resnet50']
Traceback (most recent call last):
  File "/home/ubuntu/.venvs/myenv/bin/dlio_benchmark", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 477, in main
    run_benchmark()
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 459, in run_benchmark
    benchmark = DLIOBenchmark(cfg['workload'])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 153, in __init__
    self.stats = StatsCounter()
                 ^^^^^^^^^^^^^^
  File "/home/ubuntu/.venvs/myenv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py", line 105, in __init__
    self.MPI.comm().Reduce(host_memory, host_memory_agg, op=MPI.SUM, root=0)
  File "src/mpi4py/MPI.src/Comm.pyx", line 1100, in mpi4py.MPI.Comm.Reduce
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

humengyu2012 avatar Jul 07 '25 11:07 humengyu2012

@humengyu2012 How many acc do you run when encountering this error?

xanpeng avatar Jul 08 '25 06:07 xanpeng

1024 acc and 64 hosts, each host with 92GB memory @xanpeng

humengyu2012 avatar Jul 08 '25 07:07 humengyu2012