storage
storage copied to clipboard
Usability Improvements
After the 1.0 submission we found that usability of the benchmark can be greatly improved. This issue will track the 'sub-issues' we intend to address for the 2.0 release.
Please add any items in comments and I will update this top level comment. Feel free to attend the sub-working group meeting (bi-weekly Wednesday morning starting on Nov 20th). Join the MLPerf Storage working group for the invite or message me.
Tasks
Rules Document
- [ ] Define filesystem caching rules in detail
- [ ] Define system json schema and creation process
- [ ] Define allowed time between runs
- [ ] Define rules that use local SSD for caching data
- [ ] Define rules for hyperconverged and local cache
benchmark[.py | .sh] script
- [ ] Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number
- [ ] Better installer that manages dependencies
- [ ] Containerization
-
- [ ] Ease of Deployment of Benchmark (just get it working)
-
- [ ] Cgroups and resource limits (better cache management)
- [ ] Flush Cache before a run
- [ ] Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small)
- [ ] Reportgen should run validation against outputs
- [ ] Add better system.json creation to automate the system description for consistency
-
- [ ] Add json schema checker for system documents that submitters create
- [ ] Automate execution of multiple runs
- [ ] Add support for code changes in closed to supported categories [ data loader, s3 connector, etc]
-
- [ ] Add patches directory that gets applied before execution
- [ ] Add runtime estimation and --what-if or --dry-run flag
- [ ] Automate selection of minimum required dataset
- [ ] Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets
- [ ] Split system.json into automatically capturable (clients) and manual (storage)
- [ ] Define system.json schema and add schema checker to the tool for reportgen
- [ ] Add report-dir csv of results from tests as they are run
- [ ] Collect versions of all prerequisite packages for storage and dlio
DLIO Improvements
- [ ] Reduce verbosity of logging
- [ ] Add callback handler for custom monitoring
-
- [ ] SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times
-
- [ ] Checkpoint_bench uses RPC to call execution which can be wrapped externally
- [ ] Add support for DIRECTIO
- [ ] Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc)
- [ ] Determine if global barrier for each batch matches industry behavior
Results Presentation
- [ ] Better linking and presentation of system diagrams (add working links to system diagrams to supplementals)
- [ ] Define presentation and rules for hyperconverged or systems with local cache