llm.c
llm.c copied to clipboard
Overlap gradient computation and NCCL AllReduce
On my setup, I get the following:
Before:
step 2/37: train loss 4.720275 (acc 4.688650) (224.046844 ms, 36563.773438 tok/s)
step 3/37: train loss 3.802741 (acc 3.943135) (224.151611 ms, 36555.007812 tok/s)
step 4/37: train loss 3.698719 (acc 3.800745) (227.287033 ms, 36375.347656 tok/s)
step 5/37: train loss 3.444999 (acc 3.528596) (227.886978 ms, 36260.062500 tok/s)
After:
step 2/37: train loss 4.715888 (acc 4.686493) (199.011169 ms, 41163.503906 tok/s)
step 3/37: train loss 3.798963 (acc 3.942383) (193.084412 ms, 41811.468750 tok/s)
step 4/37: train loss 3.697987 (acc 3.800879) (193.079300 ms, 42027.660156 tok/s)
step 5/37: train loss 3.444056 (acc 3.526504) (193.470459 ms, 42112.496094 tok/s)
So, a 12% speedup.
NSight Systems profiles:
Before:
After: