llm.c
llm.c copied to clipboard
Async optimizer state and model checkpointing
Additional feature to checkpoint optimizer state and model parameters using a non blocking background thread. Memcopy device buffers to pined host buffer in one shot and let the background thread do I/O operations.
In my 8xA100 setup rough latency improvement is 5.4 sec to 2.3 sec ~ 2X improvement. When it comes to the larger model sizes this feature will save a lot of time.