DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[profiling]add show_straggler argument to log_summary()

Open delock opened this issue 2 years ago • 0 comments

In tensor parallel inference straggler effect is one of the factor that impacts scaling efficiency. Between any two allreduce of tensor parallel, one worker may run slower than other workers, this worker become the straggler for this duration. Different duration (between two allreduces) might have different straggler.

When do profiling, Straggler effect usually show as part of communication latency. This is because faster worker needs to wait for straggler during communication, this usually cause long communicatio latency. Part of communication latency is waiting for straggler to start, the rest is the time real communication takes.

It would helpful to quantify straggler effect in a workload, this PR extends communciation logging by add an optional argument to log_summary:

log_summary(show_straggler=True)

When show_straggler is set to True, communication logging would show a new table showing communciation latency and straggler effect. Communication latency is defined as the duration between the time when all ranks started communication and the communication finished. Straggler effect is define as the time between current rank started communication and all ranks started communication.

Below is an example:

Comm. Op            Message Size        Count               Total Latency(ms)   Avg Latency(ms)     tput_avg (Gbps)     busbw_avg (Gbps)
broadcast
                    4.94 KB             1                   0.22                0.22                0.18                0.18
all_reduce
                    32.0 KB             3472                1281.64             0.34                1.71                0.86
                    1.0 MB              112                 178.83              1.51                11.17               5.58
log_summary_barrier
                    0B                  1                   1.33                1.33                0.00                0.00
_______________________________
Breakdown with straggler effect
-------------------------------
Comm. Op            Message Size        Count               Total comm lat(ms)  Total straggler(ms) Avg comm lat(ms)    Avg straggler(ms)
broadcast
                    4.94 KB             1                   0.22                0.00                0.22                0.00
all_reduce
                    4.0 B               1                   0.10                0.00                0.10                0.00
                    32.0 KB             3472                966.38              315.26              0.24                0.08
                    1.0 MB              112                 178.52              0.32                1.51                0.00
log_summary_barrier
                    0B                  1                   0.12                1.21                0.12                1.21

The column 'Total straggler(ms)' show the straggler time. In this case, for 32KB allreduce, real communciation latency is 966.38ms and straggler effect is 315.26ms. Which shows significant straggler effect in the workload.

In this PR, straggler effect is defined as:

sum(latency)-sum(allreduce(latency, OP=MIN))

delock avatar May 19 '23 15:05 delock