mlpstorage incorrectly identifies llama3-8b on 16 GPU as qualified CLOSED submission

Open zhenghh04 opened this issue 6 months ago • 1 comments

See the log below. It should be set as OPEN

2025-06-20 09:55:15|STATUS: Benchmark results directory: ./results/eagle/n2x8/checkpointing/llama3-8b/20250620_095514
2025-06-20 09:55:15|INFO: Found benchmark run: checkpointing_run_llama3-8b_20250620_095514
2025-06-20 09:55:15|STATUS: Verifying benchmark run for checkpointing_run_llama3-8b_20250620_095514
2025-06-20 09:55:15|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='checkpointing', command='run', model='llama3-8b', run_datetime='20250620_095514')])
2025-06-20 09:55:15|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
2025-06-20 09:55:15|STATUS: Instantiated the Checkpointing Benchmark...
2025-06-20 09:55:15|STATUS: Running benchmark command:: mpiexec -n 16 --ppn 8 --cpu-bind depth -d 16 /home/hzheng/crux/DLIO/dlio_benchmark/workspace/2025-06-19/pyenvs/2025-06-19/storage/bin/dlio_benchmark workload=llama3_8b ++hydra.run.dir=./results/eagle/n2x8/checkpointing/llama3-8b/20250620_095514 ++hydra.output_subdir=dlio_config ++workload.checkpoint.num_checkpoints_read=0 ++workload.checkpoint.num_checkpoints_write=10 ++workload.checkpoint.checkpoint_folder=.//checkpoints/n2x8/llama3-8b --config-dir=/lus/eagle/projects/PolarisAT/hzheng/crux/DLIO/dlio_benchmark/workspace/2025-06-19/storage/configs/dlio

@wvaske Could you please address this?

Jun 20 '25 14:06 zhenghh04

The same issue with larger number of process count.

2025-06-20 10:11:59|STATUS: Benchmark results directory: ./results/eagle/n128x8/checkpointing/llama3-8b/20250620_101159
2025-06-20 10:11:59|INFO: Found benchmark run: checkpointing_run_llama3-8b_20250620_101159
2025-06-20 10:11:59|STATUS: Verifying benchmark run for checkpointing_run_llama3-8b_20250620_101159
2025-06-20 10:11:59|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='checkpointing', command='run', model='llama3-8b
', run_datetime='20250620_101159')])
2025-06-20 10:11:59|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid f
or submission. Use --open or --closed to specify a configuration.
2025-06-20 10:11:59|STATUS: Instantiated the Checkpointing Benchmark...
2025-06-20 10:11:59|STATUS: Running benchmark command:: mpiexec -n 1024 --ppn 8 --cpu-bind depth -d 16 /home/hzheng/crux/DLIO/dlio_bench
mark/workspace/2025-06-19/pyenvs/2025-06-19/storage/bin/dlio_benchmark workload=llama3_8b ++hydra.run.dir=./results/eagle/n128x8/checkpointing
/llama3-8b/20250620_101159 ++hydra.output_subdir=dlio_config ++workload.checkpoint.num_checkpoints_read=0 ++workload.checkpoint.num_checkpoint
s_write=10 ++workload.checkpoint.checkpoint_folder=.//checkpoints/n128x8/llama3-8b --config-dir=/lus/eagle/projects/PolarisAT/hzheng/crux/DLIO
/dlio_benchmark/workspace/2025-06-19/storage/configs/dlio

Jun 20 '25 15:06 zhenghh04