DeepSpeed
DeepSpeed copied to clipboard
Add throughput timer configuration
The new "timers" section describes configuration for different timers.
Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment.
No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty.
We've discovered the following issues in the current implementation of the Throughput timer:
-
The timer invokes synchronize() twice on each step at start and stop.
-
Calling synchronize() ensures that all workloads running on the device complete. This makes sure that the measured time includes both the elapsed CPU and the device times.
-
However, being an expensive operation, synchronize() causes the performance drop of up to 20-25% on some accelerators (e.g. HPU). On the other hand, that avoiding the call to synchronize() still produces a meaningful performance estimate without performance drop.
-
Since there is no configuration flag to disable the timer, today a user pays consistent performance penalty, even in cases when the throughput measurement may be unnecessary (e.g. production).
The purpose of this PR is to add config options to control the above behavior.
Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks
@loadams @mrwyattii Can you help taking this PR further? @deepcharm handled all review comments.
@loadams @mrwyattii @lekurile can this change be merged?
@loadams @mrwyattii @lekurile can this change be merged?
Sorry for the delay @nelyahu - I will make sure this is merged shortly.