cubed Is there a parallel between tile GPU/TPU kernes and Cubed chunks?

Is there a parallel between tile GPU/TPU kernes and Cubed chunks?

Open alxmrs opened this issue 1 year ago • 7 comments

Tile based operations have been quite a success for creating optimal GPU kernels. The programming model, in my understanding, offers flexibility while taking advantage of cache hierarchies.

http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

The triton language takes advantage of this model by providing a sort of MLIR/LLVM middleware for custom kernel acceleration of specific NN ops. Jax even now offers its own portable version of kennel control with time semantics via Pallas.

https://jax.readthedocs.io/en/latest/pallas/index.html

I can’t help but think that there are parallels between Cubed’s chunked blockwise op and these tile based techniques. What could an intersection look like?

Maybe, as is, business logic written in cubed would have affordances for GPU/TPU lowering
If not, how can we make that so?
More diabolical still, could Cubed do this for users automatically when accelerated arrays are used (#304)? How similar are tiles to chucks, anyway? The array-aware abstractions of Cubed, to me, seem to offer enough information to make optimizations in compute. Where this is limited, I suspect modifications to Spec could make the difference.

Jun 25 '24 13:06 alxmrs

cubed cubed copied to clipboard

Is there a parallel between tile GPU/TPU kernes and Cubed chunks?

cubed
cubed copied to clipboard