Raja Gond
Raja Gond
@serendipity-zk
Thanks for the reply. Additionally, why are you not doing that for prefill? Also, since decoding is memory-bound, wouldn't breaking it into two or more microbatches be inefficient?
That makes sense. Deepseek-v3/R1 is large, so 256 seems sufficient.
Thanks for the reply. Yeah, it’s an H100 PCIe box. However, I haven’t optimized the Triton kernel yet. One more question: when you’re running prefill and decode in parallel using...