llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Overlap gradient computation and NCCL AllReduce

Open PeterZhizhin opened this issue 9 months ago • 0 comments

On my setup, I get the following:

Before:

step    2/37: train loss 4.720275 (acc 4.688650) (224.046844 ms, 36563.773438 tok/s)
step    3/37: train loss 3.802741 (acc 3.943135) (224.151611 ms, 36555.007812 tok/s)
step    4/37: train loss 3.698719 (acc 3.800745) (227.287033 ms, 36375.347656 tok/s)
step    5/37: train loss 3.444999 (acc 3.528596) (227.886978 ms, 36260.062500 tok/s)

After:

step    2/37: train loss 4.715888 (acc 4.686493) (199.011169 ms, 41163.503906 tok/s)
step    3/37: train loss 3.798963 (acc 3.942383) (193.084412 ms, 41811.468750 tok/s)
step    4/37: train loss 3.697987 (acc 3.800879) (193.079300 ms, 42027.660156 tok/s)
step    5/37: train loss 3.444056 (acc 3.526504) (193.470459 ms, 42112.496094 tok/s)

So, a 12% speedup.

NSight Systems profiles:

Before: NSight Compute profile before: backward then NCCL happen on the same stream

After: NSight Compute profile before: backward and NCCL is overlapped

PeterZhizhin avatar May 05 '24 15:05 PeterZhizhin