David Blincoe

Results 1 comments of David Blincoe

I ran some latency-focused testing on this PR using LLaMA 3.2 1B Instruct with a small batch size (~1-2) in a highly latency-constrained setting where minimizing CUDA graph launches can...