cubed icon indicating copy to clipboard operation
cubed copied to clipboard

Is there a parallel between tile GPU/TPU kernes and Cubed chunks?

Open alxmrs opened this issue 1 year ago • 7 comments

Tile based operations have been quite a success for creating optimal GPU kernels. The programming model, in my understanding, offers flexibility while taking advantage of cache hierarchies.

http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

The triton language takes advantage of this model by providing a sort of MLIR/LLVM middleware for custom kernel acceleration of specific NN ops. Jax even now offers its own portable version of kennel control with time semantics via Pallas.

https://jax.readthedocs.io/en/latest/pallas/index.html

I can’t help but think that there are parallels between Cubed’s chunked blockwise op and these tile based techniques. What could an intersection look like?

  • Maybe, as is, business logic written in cubed would have affordances for GPU/TPU lowering
  • If not, how can we make that so?
  • More diabolical still, could Cubed do this for users automatically when accelerated arrays are used (#304)? How similar are tiles to chucks, anyway? The array-aware abstractions of Cubed, to me, seem to offer enough information to make optimizations in compute. Where this is limited, I suspect modifications to Spec could make the difference.

alxmrs avatar Jun 25 '24 13:06 alxmrs