`--exec-type=docker` Launches Only a Single Process, Ignores `--num-accelerators`

Open garimajain05 opened this issue 5 months ago • 1 comments

When running mlpstorage training run with --exec-type=docker and --num-accelerators=6, only one process is executed, even though multiple accelerators are specified:

[OUTPUT] Running DLIO [Training & Checkpointing] with 1 process(es)

The benchmark command generated:

/local/.venv/bin/dlio_benchmark workload=unet3d_h100 \
++hydra.run.dir=/mnt/array/j25/result_j16/run7/training/unet3d/run/20250812_115117 \
++hydra.output_subdir=dlio_config \
++workload.dataset.num_files_train=21000 \
++workload.dataset.num_subfolders_train=70 \
++workload.reader.read_threads=26 \
++workload.reader.odirect=True \
++workload.reader.prefetch_size=1 \
++workload.workflow.profiling=iostat \
++workload.workflow.hydra_logging=enabled \
++workload.workflow.job_logging=enabled \
++workload.dataset.data_folder=/mnt/array/j25/data/unet3d \
--config-dir=/local/.venv/lib/python3.10/site-packages/configs/dlio

Works as Expected with `--exec-type=mpi`

Using MPI does correctly spawn all 6 processes:

mpirun -n 6 -host <ip1>:6 /local/.venv/bin/dlio_benchmark workload=unet3d_h100 \
++hydra.run.dir=/mnt/array/j25/result_j16/run7/training/unet3d/run/20250812_121415 \
...

Output:

[OUTPUT] Running DLIO [Training & Checkpointing] with 6 process(es)

Additional Observation:

In version 2.0 of the code, the logic for wrapping the benchmark command with mpirun is only applied if exec_type == mpi, but bypassed entirely when exec_type == docker.

Relevant code snippet:

if self.args.exec_type == EXEC_TYPE.MPI:
    self.logger.debug(f'Generating MPI Command with binary "{self.args.mpi_bin}"')
    mpi_prefix = generate_mpi_prefix_cmd(
        self.args.mpi_bin, self.args.hosts, self.args.num_processes,
        self.args.oversubscribe, self.args.allow_run_as_root,
        self.args.mpi_params, self.logger
    )
    cmd = f"{mpi_prefix} {cmd}"

It seems the assumption might have been that Docker manages process launching internally but it doesn't. This causes it to default to a single process.

Aug 12 '25 17:08 garimajain05

Command with --exec-type docker and --num-accelerators 4: mlpstorage training run --hosts <ip1> --num-client-hosts 1 --client-host-memory-in-gb 256 --num-accelerators 4 --accelerator-type h100 --exec-type docker --model unet3d --data-dir /mnt/array/j25/data/ --results-dir /mnt/array/j25/result_j16/run7 --param dataset.num_files_train=14000 dataset.num_subfolders_train=70 reader.read_threads=26 reader.odirect=True reader.prefetch_size=1

Output: Why does it process with 1 accelerator when the accelerator count is greater than 1 i.e., --num-accelerators 4?

Setting attr from num_accelerators to 4
Hosts is: ['<ip1>']
Hosts is: ['<ip1>']
2025-08-13 09:28:51|STATUS: Benchmark results directory: /mnt/array/j25/result_j16/run7/training/unet3d/run/20250813_092851
2025-08-13 09:28:51|INFO: Found benchmark run: training_run_unet3d_20250813_092851
2025-08-13 09:28:51|STATUS: Verifying benchmark run for training_run_unet3d_20250813_092851
2025-08-13 09:28:51|RESULT: Minimum file count dictated by 500 step requirement of given accelerator count and batch size.
2025-08-13 09:28:51|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 14000 (Parameter: Overrode Parameters)
2025-08-13 09:28:51|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_subfolders_train = 70 (Parameter: Overrode Parameters)
2025-08-13 09:28:51|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.read_threads = 26 (Parameter: Overrode Parameters)
2025-08-13 09:28:51|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.odirect = True (Parameter: Overrode Parameters)
2025-08-13 09:28:51|STATUS: Closed: [CLOSED] Closed parameter override allowed: reader.prefetch_size = 1 (Parameter: Overrode Parameters)
2025-08-13 09:28:51|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='unet3d', run_datetime='20250813_092851')])
2025-08-13 09:28:51|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
2025-08-13 09:28:51|STATUS: Running benchmark command:: /local/.venv/bin/dlio_benchmark workload=unet3d_h100 ++hydra.run.dir=/mnt/array/j25/result_j16/run7/training/unet3d/run/20250813_092851 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=14000 ++workload.dataset.num_subfolders_train=70 ++workload.reader.read_threads=26 ++workload.reader.odirect=True ++workload.reader.prefetch_size=1 ++workload.dataset.data_folder=/mnt/array/j25/data/unet3d --config-dir=/local/.venv/lib/python3.10/site-packages/configs/dlio
[OUTPUT] 2025-08-13T09:28:55.054409 Running DLIO [Training & Checkpointing] with 1 process(es)
[WARNING] Number of files for training in /mnt/array/j25/data/unet3d/train (700000) is more than requested (14000). A subset of files will be used
[OUTPUT] 2025-08-13T09:28:57.338618 Model size: 0.000010 GB
[OUTPUT] 2025-08-13T09:28:57.338712 Total checkpoint size: 0.000010 GB
[OUTPUT] 2025-08-13T09:28:57.387789 Max steps per epoch: 2000 = 1 * 14000 / 7 / 1 (samples per file * num files / batch size / comm size)
[OUTPUT] 2025-08-13T09:28:57.459828 Starting epoch 1: 2000 steps expected
[OUTPUT] 2025-08-13T09:28:57.460133 Starting block 1
[OUTPUT] 2025-08-13T09:39:53.020619 Ending block 1 - 2000 steps completed in 655.56 s
[OUTPUT] 2025-08-13T09:39:53.060324 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 98.9445
[OUTPUT] 2025-08-13T09:39:53.060389 Epoch 1 - Block 1 [Training] Throughput (samples/second): 21.4095
[OUTPUT] 2025-08-13T09:39:53.060414 Epoch 1 - Block 1 [Training] Computation time per step (second): 0.3233+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2025-08-13T09:39:53.065031 Ending epoch 1 - 2000 steps completed in 655.61 s
[OUTPUT] 2025-08-13T09:39:53.151748 Starting epoch 2: 2000 steps expected
[OUTPUT] 2025-08-13T09:39:53.152421 Starting block 1
[OUTPUT] 2025-08-13T09:50:48.199900 Ending block 1 - 2000 steps completed in 655.05 s
[OUTPUT] 2025-08-13T09:50:48.221245 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 98.9690
[OUTPUT] 2025-08-13T09:50:48.221312 Epoch 2 - Block 1 [Training] Throughput (samples/second): 21.4150
[OUTPUT] 2025-08-13T09:50:48.221337 Epoch 2 - Block 1 [Training] Computation time per step (second): 0.3233+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2025-08-13T09:50:48.223769 Ending epoch 2 - 2000 steps completed in 655.07 s
[OUTPUT] 2025-08-13T09:50:48.277373 Starting epoch 3: 2000 steps expected
[OUTPUT] 2025-08-13T09:50:48.277728 Starting block 1
[OUTPUT] 2025-08-13T10:01:43.512375 Ending block 1 - 2000 steps completed in 655.23 s
[OUTPUT] 2025-08-13T10:01:43.526991 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 99.0032
[OUTPUT] 2025-08-13T10:01:43.527047 Epoch 3 - Block 1 [Training] Throughput (samples/second): 21.4225
[OUTPUT] 2025-08-13T10:01:43.527071 Epoch 3 - Block 1 [Training] Computation time per step (second): 0.3233+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2025-08-13T10:01:43.529770 Ending epoch 3 - 2000 steps completed in 655.25 s
[OUTPUT] 2025-08-13T10:01:43.581745 Starting epoch 4: 2000 steps expected
[OUTPUT] 2025-08-13T10:01:43.582063 Starting block 1
[OUTPUT] 2025-08-13T10:12:38.752012 Ending block 1 - 2000 steps completed in 655.17 s
[OUTPUT] 2025-08-13T10:12:38.769227 Epoch 4 - Block 1 [Training] Accelerator Utilization [AU] (%): 99.0038
[OUTPUT] 2025-08-13T10:12:38.769285 Epoch 4 - Block 1 [Training] Throughput (samples/second): 21.4224
[OUTPUT] 2025-08-13T10:12:38.769310 Epoch 4 - Block 1 [Training] Computation time per step (second): 0.3233+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2025-08-13T10:12:38.772062 Ending epoch 4 - 2000 steps completed in 655.19 s
[OUTPUT] 2025-08-13T10:12:38.824156 Starting epoch 5: 2000 steps expected
[OUTPUT] 2025-08-13T10:12:38.824480 Starting block 1
[OUTPUT] 2025-08-13T10:23:35.349395 Ending block 1 - 2000 steps completed in 656.52 s
[OUTPUT] 2025-08-13T10:23:35.371108 Epoch 5 - Block 1 [Training] Accelerator Utilization [AU] (%): 98.8574
[OUTPUT] 2025-08-13T10:23:35.371196 Epoch 5 - Block 1 [Training] Throughput (samples/second): 21.3909
[OUTPUT] 2025-08-13T10:23:35.371224 Epoch 5 - Block 1 [Training] Computation time per step (second): 0.3233+/-0.0000 (set value: {'mean': 0.323})
[OUTPUT] 2025-08-13T10:23:35.372032 Starting saving checkpoint 1 after total step 2000 for epoch 5
[OUTPUT] 2025-08-13T10:23:35.375846 Finished saving checkpoint 1 for epoch 5 in 0.0038 s; Throughput: 0.0025 GB/s
[OUTPUT] 2025-08-13T10:23:35.378632 Ending epoch 5 - 2000 steps completed in 656.55 s
[OUTPUT] 2025-08-13T10:23:35.509268 Saved outputs in /mnt/array/j25/result_j16/run7/training/unet3d/run/20250813_092851
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 98.9549 (0.0552)
[METRIC] Training Throughput (samples/second): 21.4119 (0.0119)
[METRIC] Training I/O Throughput (MB/second): 2993.5823 (1.6691)
[METRIC] train_au_meet_expectation: success
[METRIC] ==========================================================

Aug 13 '25 14:08 garimajain05

`--exec-type=docker` Launches Only a Single Process, Ignores `--num-accelerators`

Works as Expected with --exec-type=mpi

Additional Observation:

Works as Expected with `--exec-type=mpi`