feat: support kFactor 8 used in mma tensor op tile iterator
support Layout::kFactor with 8 loading data from shared memory:
| 0 | 16 | 32 | 48 | 64 | 80 | 96 | 112 |
|---|---|---|---|---|---|---|---|
| T0 | T1 | T2 | T3 | T4 | T5 | T6 | T7 |
| T9 | T8 | T11 | T10 | T13 | T12 | T15 | T14 |
| T18 | T19 | T16 | T17 | T22 | T23 | T20 | T21 |
| T27 | T26 | T25 | T24 | T31 | T30 | T29 | T28 |
@hwu36 Could you help review my pull request, please
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.