Deep dive into the Megakernel approach since it's well aligned with OCANNL design
They implement an interpreter on the GPU, maybe we can avoid that yet still use their solutions for within-kernel synchronization. Or maybe we can go the interpreter route, to be decided.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles?s=08
https://github.com/mirage-project/mirage/tree/mpk https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17 https://x.com/JiaZhihao/status/1935767958963314773
Out of curiosity, why do you say that the Megakernel approach is aligned with OCANNL's design?
Because splitting of megakernels into proper kernels is not implemented yet.
Less tongue-in-cheek: megakernel = routine in OCANNL terminology.