Benchmark.py script for v2.0
Benchmark script is getting migrated from bash to python for better integration with results checking scripts.
- Update to latest version of DLIO
- Started updating rules document
- Separate config locations for training / checkpoint / vectordb
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
@wvaske
since dlio now publishes to https://pypi.org/project/dlio-benchmark/2.0.0/ we can completely remove the submodule from here. it is not needed anymore...
and instead of
pip3 install -r dlio_benchmark/requirements.txt
just do
pip3 install dlio-benchmark==2.0.0
wdyt ? much cleaner and more pythonic...
@zhenghh04 , is the version of DLIO on pypi up-to-date with your changes? If not, can you rev the version to 2.1 or 2.0.1 and push a new version?
@FileSystemGuy @wvaske who can merge my #84 ? i can't even comment on that.... sorry for polluting your pr...
Boris,
I don't see any problems from a "product owner" type of perspective, your change is "approved" in that sense, but I don't know if there are any technical comments that should be addressed before we merge. That's up to Johnu and Huihuo and Wes.
Thanks,
Curtis
From: Boris Glimcher @.> Sent: Tuesday, March 25, 2025 1:33 PM To: mlcommons/storage @.> Cc: FileSystemGuy @.>; Mention @.> Subject: Re: [mlcommons/storage] Benchmark.py script for v2.0 (PR #85)
@FileSystemGuyhttps://github.com/FileSystemGuy @wvaskehttps://github.com/wvaske who can merge my #84https://github.com/mlcommons/storage/pull/84 ? i can't even comment on that.... sorry for polluting your pr...
— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/storage/pull/85#issuecomment-2752448885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZDB7MGSDUDD56GY67WCGD2WG4SNAVCNFSM6AAAAABYCTMAVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJSGQ2DQOBYGU. You are receiving this because you were mentioned.Message ID: @.***>
[glimchb]glimchb left a comment (mlcommons/storage#85)https://github.com/mlcommons/storage/pull/85#issuecomment-2752448885
@FileSystemGuyhttps://github.com/FileSystemGuy @wvaskehttps://github.com/wvaske who can merge my #84https://github.com/mlcommons/storage/pull/84 ? i can't even comment on that.... sorry for polluting your pr...
— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/storage/pull/85#issuecomment-2752448885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZDB7MGSDUDD56GY67WCGD2WG4SNAVCNFSM6AAAAABYCTMAVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJSGQ2DQOBYGU. You are receiving this because you were mentioned.Message ID: @.***>
The attributes "per_host_mem_kB" and "total_mem_kB" are in kilo-bytes but the CLI args are in GB and the raw memory capacity pulled from the nodes info is in B. Would standardizing all memory capacity variables on GB (or MB) be less risky for confusion?
My preference is to store information in it's "natural" form with a unit on the variable name so we can do math when we need to. But I'm open to changing.
EDIT: I changed my mind and went with bytes internally.
As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the memory requirements per node when using 32 nodes, displaying that the 1T parameter model far exceeds our available hardware (above 512GB/node). We would not be able to run the 1T model, so we recommend excluding it from the v2.0 rules unless the calculations in this table are incorrect (model size/ #node)
While the 1T model may be of interest to customers, it is impractical for us to run. For this reason, we would prefer not to include the 1T parameter model in the v2.0 submission guidelines (checkpointing).
The 405B model might already be at the limit of what we plan to run, as it would likely require 40 nodes instead of 32.
| Model | SIZE (GB) | # Nodes | Memory/node (GB) |
|----------------|-----------|---------|-------------------|
| 8B (Zero3) | 88 | 1 | 88 |
| 70B | 1100 | 32 | 34,375 |
| 405B | 6000 | 32 | 187,5 |
| 1T | 17000 | 32 | 531,25 |
As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the memory requirements per node when using 32 nodes, displaying that the 1T parameter model far exceeds our available hardware (above 512GB/node). We would not be able to run the 1T model, so we recommend excluding it from the v2.0 rules unless the calculations in this table are incorrect (model size/ #node)
While the 1T model may be of interest to customers, it is impractical for us to run. For this reason, we would prefer not to include the 1T parameter model in the v2.0 submission guidelines (checkpointing).
The 405B model might already be at the limit of what we plan to run, as it would likely require 40 nodes instead of 32.
| Model | SIZE (GB) | # Nodes | Memory/node (GB) | |----------------|-----------|---------|-------------------| | 8B (Zero3) | 88 | 1 | 88 | | 70B | 1100 | 32 | 34,375 | | 405B | 6000 | 32 | 187,5 | | 1T | 17000 | 32 | 531,25 |
In the previous release Huawei ran with 51 nodes and Nutanix ran with up to 66 nodes. So it's not unreasonable to expect that some submitters will have the systems available to run at the 1T level.
And with the announcement of Llama 4 Behemoth clocking in at 2T parameters, the 1T might be 'small' for the benchmark.
Each of the sizes is optional. If a submitter wants to only run on 70b, that is perfectly acceptable. The 1T param model is difficult to represent.
As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the memory requirements per node when using 32 nodes, displaying that the 1T parameter model far exceeds our available hardware (above 512GB/node). We would not be able to run the 1T model, so we recommend excluding it from the v2.0 rules unless the calculations in this table are incorrect (model size/ #node) While the 1T model may be of interest to customers, it is impractical for us to run. For this reason, we would prefer not to include the 1T parameter model in the v2.0 submission guidelines (checkpointing). The 405B model might already be at the limit of what we plan to run, as it would likely require 40 nodes instead of 32.
| Model | SIZE (GB) | # Nodes | Memory/node (GB) | |----------------|-----------|---------|-------------------| | 8B (Zero3) | 88 | 1 | 88 | | 70B | 1100 | 32 | 34,375 | | 405B | 6000 | 32 | 187,5 | | 1T | 17000 | 32 | 531,25 |In the previous release Huawei ran with 51 nodes and Nutanix ran with up to 66 nodes. So it's not unreasonable to expect that some submitters will have the systems available to run at the 1T level.
And with the announcement of Llama 4 Behemoth clocking in at 2T parameters, the 1T might be 'small' for the benchmark.
Each of the sizes is optional. If a submitter wants to only run on 70b, that is perfectly acceptable. The 1T param model is difficult to represent.
I created two DLIO patches to reduce memory usage during checkpointing. They haven’t been merged yet, but they would be extremely useful for us to have.
The 405B benchmark will be difficult for DDN to submit without PR #281. I believe this patch is straightforward and non-controversial. If it cannot be merged, I would vote to adjust the 405B configuration to use TP=16, PP=18, DP=2 instead of the current TP=8, PP=18, DP=4 (which totals 576). This configuration uses the same number of GPUs but should reduce memory usage by lowering DP. However, I’m unsure if this alone would suffice, which is why the patch is critical.
There is hope to submit the 1T benchmark, and potentially for a lot of submitters, if PR #282 is accepted. Though it involves deeper changes, I added metrics showing equivalent performance for DDN in single-node scenarios. The potential discussion could be the optimizer chunking impacts on performance, but I didn't observe any effect so far. During my measurements, this patch reduced RAM usage by approximately 67%.
Again, we would like 16 (TP) x 32 (PP) x 2 (DP) = 1024 for 1T instead of 8 (TP) x 32 (PP) x 4 (DP) = 1024. Even with no patch this should slightly decrease the memory usage in zero1 mode.
Checkpointing works.
Check out the history function "mlpstorage history show"
Added a report generator. "mlpstorage reports reportgen"
Please test and provide feedback. The report generated has a lot of extra information that helps to know how the test was run W.R.T. input files, args, params, and how they combine.
I recommend pulling the csv into Excel via power query so it's a data connection to a table. Then create a pivot table from the data table for doing analysis.
I currently capture cpu and memory information with passwordless ssh. If that doesn't work for you, please let me know so I can update with a different methodology.