hipBLASLt
hipBLASLt copied to clipboard
Allows different CuCount Arch can share kernel object
This PR is for reducing hipblaslt library size, which merges different CU code-objects as one. Then same kernel won't be duplicated into different CU code objects. But it may increase the code object loading time.
09/20 library size diff for with/wo this PR without: 12800M with: 9980M
Can't work cause Henry hardcoded the CU count into the asm kernels with fork parameters.
This PR only merged the code object for exact same kernels between 80 cu and 304cu. It won't impact the kernel selection design, which mean 80cu gfx942 can find 304 cu solution even without this PR.
need to wait until ActivationFunCall is enabled for 80cu HHS/BBS.
Closing the pull request in this repo. Please refer to the migrated pull request for updates.