[GPU][DT] Tracking issue for data-tiled llama 3.1 405b

Open jtuyls opened this issue 3 months ago • 0 comments

To enable data-tiling on llama 3.1 405b we need a couple of new features/fixes so creating an issue to track the sub-tasks/progress and discuss performance numbers once we get there.

Some of the initial tasks:

[ ] The memory footprint of the data-tiled execution needs to be reduced so that the 405b model weights fit on a single GPU: https://github.com/iree-org/iree/issues/21659
[x] We need support for scaled matmul with encodings as the main matmuls will operate on the mxfp4 data type: https://github.com/iree-org/iree/issues/21923
[ ] We need to implement a ukernel with the mxfp4 data type and MFMAs: https://github.com/iree-org/iree/issues/21938
[x] Get Llama 405b MLIR without asm/wave mxfp4 kernels: https://github.com/iree-org/iree/issues/22002

Sep 11 '25 07:09 jtuyls