DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Add throughput timer configuration

Open deepcharm opened this issue 10 months ago • 5 comments

The new "timers" section describes configuration for different timers.

Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment.

No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty.

deepcharm avatar Apr 04 '24 17:04 deepcharm

We've discovered the following issues in the current implementation of the Throughput timer:

  • The timer invokes synchronize() twice on each step at start and stop.

  • Calling synchronize() ensures that all workloads running on the device complete. This makes sure that the measured time includes both the elapsed CPU and the device times.

  • However, being an expensive operation, synchronize() causes the performance drop of up to 20-25% on some accelerators (e.g. HPU). On the other hand, that avoiding the call to synchronize() still produces a meaningful performance estimate without performance drop.

  • Since there is no configuration flag to disable the timer, today a user pays consistent performance penalty, even in cases when the throughput measurement may be unnecessary (e.g. production).

The purpose of this PR is to add config options to control the above behavior.

deepcharm avatar Apr 04 '24 17:04 deepcharm

Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks

deepcharm avatar Apr 25 '24 12:04 deepcharm

@loadams @mrwyattii Can you help taking this PR further? @deepcharm handled all review comments.

nelyahu avatar May 07 '24 06:05 nelyahu

@loadams @mrwyattii @lekurile can this change be merged?

nelyahu avatar May 12 '24 18:05 nelyahu

@loadams @mrwyattii @lekurile can this change be merged?

Sorry for the delay @nelyahu - I will make sure this is merged shortly.

loadams avatar May 13 '24 16:05 loadams