llm.c
llm.c copied to clipboard
gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality
In the forward pass in gpt2_train.cu
- adding cuda streams with events for async layered operations
- added offset precalculations and cache prefetching for efficient data access with high temporal locality
changes
-
cuda streams and events
- four independent cuda streams: input copy, target copy, compute, loss
- non-blocking streams overlap data transfers with computation and loss calculation
- prioritized streams to minimize interference
-
cache prefetching
- prefetching offsets into cache for enhanced cpu-gpu data handling
- high temporal locality hints
the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed
the impact may not be as noticeable for small models
the code is heavily commented in the sections modified to both document and educate
testing
On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback