composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

[CK_TILE] Share partition index across threads and specify offset in load_tile()/async_load_tile()/load_tile_transpose()

Open poyenc opened this issue 2 months ago • 0 comments

Proposed changes

  1. Allow specifying a replacement partition index for the tile_window_with_static_distribution<> template, with the resulting offset applied as an immediate offset in the instructions.
  2. Allow specifying an additional offset for the load_tile(), async_load_tile() and load_tile_transpose() calls to ensure that each thread reads the correct data, even when sharing the same partition index.
  3. Use warp-uniform LDS offset in tile_window::async_load() to avoid unnecessary VOPs and readfirstlane.
  4. Allow specifying pre-computed partition_index to the tile window utilities and the Default2DEpilogue<>.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • [ ] I have added inline documentation which enables the maintainers with understanding the motivation
  • [ ] I have removed the stale documentation which is no longer relevant after this pull request
  • [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • [x] I have run clang-format on all changed files
  • [x] Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

poyenc avatar Sep 23 '25 14:09 poyenc