iree
iree copied to clipboard
[GPU][DT] Tracking issue for data-tiled llama 3.1 405b
To enable data-tiling on llama 3.1 405b we need a couple of new features/fixes so creating an issue to track the sub-tasks/progress and discuss performance numbers once we get there.
Some of the initial tasks:
- [ ] The memory footprint of the data-tiled execution needs to be reduced so that the 405b model weights fit on a single GPU: https://github.com/iree-org/iree/issues/21659
- [x] We need support for scaled matmul with encodings as the main matmuls will operate on the mxfp4 data type: https://github.com/iree-org/iree/issues/21923
- [ ] We need to implement a ukernel with the mxfp4 data type and MFMAs: https://github.com/iree-org/iree/issues/21938
- [x] Get Llama 405b MLIR without asm/wave mxfp4 kernels: https://github.com/iree-org/iree/issues/22002