Results 7 issues of sajikvr

Command used `mlpstorage training run --hosts 10.57.205.80,10.57.205.86,10.57.205.87,10.57.205.84,10.57.205.82,10.57.205.85 --model resnet50 --data-dir /mnt/data --params reader.read_threads=4 dataset.num_files_train=70000 dataset.num_subfolders_train=200 checkpoint.checkpoint_folder=/mnt/data --client-host-memory-in-gb 256 --num-accelerators 200 --accelerator-type h100 --checkpoint-folder /mnt/data --allow-run-as-root` Error ``` 2025-06-25 20:59:16|STATUS: Running...

code

Trying to run llama-405b checkpointing with 32 hosts, each with 220G memory. As per the datasize command, the per host (8 ranks per host) memory needed is around 90G ```...

Running only the read part `mlpstorage checkpointing run --hosts 10.57.205.101,10.57.205.102 --model llama3-70b --client-host-memory-in-gb 220 --num-processes 8 --checkpoint-folder /mnt/host_checkpointing --results-dir checkpoint_test_4_hosts_llama3-70b --num-checkpoints-read 1 --num-checkpoints-write 0 --allow-run-as-root` Test seems to be succeeding,...

Using 16 or 32 hosts, each with 256G memory, test starts and each host has dlio_benchmark processes running, but no progress on the test itself. 8b and 70b models runs...

Assigning different accelerator count per host, where the total number of accelerators is not divisible by host count (for eg, 9 accelerators, 2 hosts, 5 & 4 accelerators on each...

code

Since we are not merging the fix into v2.0 and agreed to allow using the fix if anyone needs it, we need to document it as an allowed change in...

rule

From the last meeting, we agreed upon allowing using fix for issue#157 (https://github.com/mlcommons/storage/issues/157) as part of closed category submission. We need to add that as part of submission rules as...

rule