cubed
cubed copied to clipboard
Is there a parallel between tile GPU/TPU kernes and Cubed chunks?
Tile based operations have been quite a success for creating optimal GPU kernels. The programming model, in my understanding, offers flexibility while taking advantage of cache hierarchies.
http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
The triton language takes advantage of this model by providing a sort of MLIR/LLVM middleware for custom kernel acceleration of specific NN ops. Jax even now offers its own portable version of kennel control with time semantics via Pallas.
https://jax.readthedocs.io/en/latest/pallas/index.html
I can’t help but think that there are parallels between Cubed’s chunked blockwise op and these tile based techniques. What could an intersection look like?
- Maybe, as is, business logic written in cubed would have affordances for GPU/TPU lowering
- If not, how can we make that so?
- More diabolical still, could Cubed do this for users automatically when accelerated arrays are used (#304)? How similar are tiles to chucks, anyway? The array-aware abstractions of Cubed, to me, seem to offer enough information to make optimizations in compute. Where this is limited, I suspect modifications to Spec could make the difference.