storage
storage copied to clipboard
MLPerf™ Storage Benchmark Suite
Running only the read part `mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102 --model llama3-70b --client-host-memory-in-gb 220 --num-processes 8 --checkpoint-folder /mnt/host_checkpointing --results-dir checkpoint_test_4_hosts_llama3-70b --num-checkpoints-read 1 --num-checkpoints-write 0 --allow-run-as-root` Test seems to be succeeding,...
Using 16 or 32 hosts, each with 256G memory, test starts and each host has dlio_benchmark processes running, but no progress on the test itself. 8b and 70b models runs...
Assigning different accelerator count per host, where the total number of accelerators is not divisible by host count (for eg, 9 accelerators, 2 hosts, 5 & 4 accelerators on each...
Since we are not merging the fix into v2.0 and agreed to allow using the fix if anyone needs it, we need to document it as an allowed change in...
From the last meeting, we agreed upon allowing using fix for issue#157 (https://github.com/mlcommons/storage/issues/157) as part of closed category submission. We need to add that as part of submission rules as...
Following up on the concern raised in issue #177 , I noticed that although the Submission Guidelines have been updated, the `mlperf_storage_report.json` file still cannot be generated as expected. When...
I'm getting these warning when running mlpstorage ```bash WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1750688708.806475 179026 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting...
Seeing DLIO hang during the first epoch when running certain accelerators counts. Running with 6x a100 or 7x a100 will cause the test to hang after printing the summary of...
See the log below. It should be set as OPEN ``` 2025-06-20 09:55:15|STATUS: Benchmark results directory: ./results/eagle/n2x8/checkpointing/llama3-8b/20250620_095514 2025-06-20 09:55:15|INFO: Found benchmark run: checkpointing_run_llama3-8b_20250620_095514 2025-06-20 09:55:15|STATUS: Verifying benchmark run for checkpointing_run_llama3-8b_20250620_095514...
I want to request that for single-system power measurement, BMC/IPMI-reported power, or in-band reported via ACPI or other methods, can be used. This is much more detailed and useful than...