zeus icon indicating copy to clipboard operation
zeus copied to clipboard

`OperationProfiler` and `PerseusOptimizer` server and client

Open jaywonchung opened this issue 2 years ago • 2 comments

Perseus is an energy scheduler for large model training (although we're looking into applying this for large model inference, too).

Perseus requires the time and energy consumption profiling results of each forward and backward computations in each pipeline stage in order to schedule energy with lowtime. That's what OperationProfiler will do.

The PerseusOptimizer server will, for now, receive a Python file that lists GPU frequencies (produced by lowtime) and instruct the PerseusOptimizer client (integrated into the user's training framework) to change GPU frequencies. The server-client split is beneficial in order for Perseus to be agnostic to the training framework. Otherwise, energy scheduling (which requires a holistic view of all computations that happen across all ranks, i.e. the "policy") and the method of realizing the energy schedule in a distributed fashion (i.e., the "mechanism") end up being coupled.

jaywonchung avatar Oct 08 '23 18:10 jaywonchung

PerseusOptimizer was implemented as PipelineFrequencyOptimizer but OperationProfiler hasn't been implemented yet.

jaywonchung avatar Aug 12 '24 00:08 jaywonchung