training_policies
training_policies copied to clipboard
Performance measurement for short run times and DVFS
-
Need run-time >5min in order to get good estimate of steady state perf otherwise power draw can exceed TDP for short periods. So not really a fair measurement.
-
E.g. scaling across many nodes can have short run-times => stay in the “cold DVFS” regime (related to temperature of heat sink or cooling solution)
-
Problem is that at-scale runtime measurement is no longer representative (assuming goal is steady state measurement).
-
Can we mitigate this through rules for how to run these systems?
-
Do we all agree with this assumption?
-
Mitigation
-
Testing protocol which requires warm-up period which exercises hw and burns power, followed by measurement period
-
Require X warm-up runs to ensure that total run-time exceeds minimum length
-
Or turn off dynamic clock (DK believes this is a bad bad idea!)
-
Chip vendors would need to provide guidance about this protocol
-
Minimum run time will vary based on cooling solution (e.g., liquid vs. air-cooled)
-
Biggest downside is increased complexity for groups which perform the submission runs (and it’s already complex)
-
Complex how it would interact with caching policy (which requires cold start) in the case of “do X runs first”
DK: Discussion with expert confirms that 5 minute warm up period would work for air-cooled system. Must find details for liquid cooled systems.
AIs:
DK to talk to liquid cooling people Vendors to talk to internal power management experts, please ask about liquid cooled in particular
One idea is to start power measurement at the 8 or 16 chip scale -- roughly one "box". All benchmarks appear to run >5mins at this scale. We can look at other options for larger scales in the future.
Backlog since no power in v0.7