storage Datasize metadata consistency with stderr

Issue #208 reports that the metadata output files created by datasize is inconsistent with the stderr return when executing the command. I verified this is correct because the values used to feed the metadata output are never updated after calculating the values. This commit updates the values that feed the metadata output file with the values calculated in executing datasize.

I'm attaching the scripts I ran to test the change in effort to help review:

view_test_results.sh verify_metadata.sh

Oct 29 '25 18:10 fincherjc

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Oct 29 '25 18:10 github-actions[bot]

Hi @fincherjc, I also spotted this issue when running MLPerf Storage. After applying your patch to commit v2.0. I can find num_files_train in metadata.json, whereas the run command still uses 168 examples if I don't override with --param dataset.num_files_train=28000. Did I miss anything?

Nov 06 '25 09:11 botbw

Hello @botbw I do not expect this change to impact the 'training run' command. As far as I know, the intended design of the benchmark is to specify the maximum size of the test when performing datasize and datagen to size and seed the dataset. When performing 'training run' you may not always be executing on the full data set. For example, you may want to try training 1 accelerator, then scale up to max to see as far as you can go until failure.

For this reason, I've always manually specified the num_files_train when I run the benchmark (minding the minimums laid out in the rules section). If you do not specify the num_files_train in your run command, I expect that the benchmark will fall-back to what is specified in the config file for the specific model/workload as given in the respective workload yaml (see configs/workload for the specific benchmark configs). In checking those files, that seems to be 168 as you reported here.

There may be a valid idea for improvement to have training run detect and execute on the full dataset, or to autoscale num_files based on the number of accelerators -> but this is a separate issue and work effort than what is covered under this change.

Nov 06 '25 13:11 fincherjc

There may be a valid idea for improvement to have training run detect and execute on the full dataset, or to autoscale num_files based on the number of accelerators -> but this is a separate issue and work effort than what is covered under this change.

@fincherjc I see. This now makes a lot more sense to me. Thanks for your great work!

Nov 07 '25 03:11 botbw