llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality

Open bgorlick opened this issue 8 months ago • 0 comments

In the forward pass in gpt2_train.cu

  • adding cuda streams with events for async layered operations
  • added offset precalculations and cache prefetching for efficient data access with high temporal locality

changes

  • cuda streams and events

    • four independent cuda streams: input copy, target copy, compute, loss
    • non-blocking streams overlap data transfers with computation and loss calculation
    • prioritized streams to minimize interference
  • cache prefetching

    • prefetching offsets into cache for enhanced cpu-gpu data handling
    • high temporal locality hints

the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed

the impact may not be as noticeable for small models

the code is heavily commented in the sections modified to both document and educate

testing

On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback

bgorlick avatar Jun 18 '24 04:06 bgorlick