storage icon indicating copy to clipboard operation
storage copied to clipboard

Usability Improvements

Open wvaske opened this issue 1 year ago • 0 comments

After the 1.0 submission we found that usability of the benchmark can be greatly improved. This issue will track the 'sub-issues' we intend to address for the 2.0 release.

Please add any items in comments and I will update this top level comment. Feel free to attend the sub-working group meeting (bi-weekly Wednesday morning starting on Nov 20th). Join the MLPerf Storage working group for the invite or message me.

Tasks

Rules Document

  • [ ] Define filesystem caching rules in detail
  • [ ] Define system json schema and creation process
  • [ ] Define allowed time between runs
  • [ ] Define rules that use local SSD for caching data
  • [ ] Define rules for hyperconverged and local cache

benchmark[.py | .sh] script

  • [ ] Unique names for files and directories with structure for benchmark, accelerator, count, run-sequence, run-number
  • [ ] Better installer that manages dependencies
  • [ ] Containerization
    • [ ] Ease of Deployment of Benchmark (just get it working)
    • [ ] Cgroups and resource limits (better cache management)
  • [ ] Flush Cache before a run
  • [ ] Validate inputs for –closed runs (eg: don’t allow runs against datasets that are too small)
  • [ ] Reportgen should run validation against outputs
  • [ ] Add better system.json creation to automate the system description for consistency
    • [ ] Add json schema checker for system documents that submitters create
  • [ ] Automate execution of multiple runs
  • [ ] Add support for code changes in closed to supported categories [ data loader, s3 connector, etc]
    • [ ] Add patches directory that gets applied before execution
  • [ ] Add runtime estimation and --what-if or --dry-run flag
  • [ ] Automate selection of minimum required dataset
  • [ ] Determine if batch sizes in MLPerf Training are representative of batch sizes for realistically sized datasets
  • [ ] Split system.json into automatically capturable (clients) and manual (storage)
  • [ ] Define system.json schema and add schema checker to the tool for reportgen
  • [ ] Add report-dir csv of results from tests as they are run
  • [ ] Collect versions of all prerequisite packages for storage and dlio

DLIO Improvements

  • [ ] Reduce verbosity of logging
  • [ ] Add callback handler for custom monitoring
    • [ ] SPECStorage uses a “PRIME_MON_SCRIPT” environment variable that will execute at different times
    • [ ] Checkpoint_bench uses RPC to call execution which can be wrapped externally
  • [ ] Add support for DIRECTIO
  • [ ] Add seed for dataset creation so that distribution of sizes is the same for all submitters (file 1 = mean + x bytes, file 2 = mean + y bytes, etc)
  • [ ] Determine if global barrier for each batch matches industry behavior

Results Presentation

  • [ ] Better linking and presentation of system diagrams (add working links to system diagrams to supplementals)
  • [ ] Define presentation and rules for hyperconverged or systems with local cache

wvaske avatar Nov 04 '24 17:11 wvaske