Chen Zhengda
Chen Zhengda
I have also encountered the same problem. Do you know how to solve it? @zt991211
Besides the error in p1 = p2 = p3 = (int)timestep - mrope_position_delta_, the current branch produces incorrect results during batch inference. @irexyc
Hi @void-main, I've encountered the same issue with high Triton kernel launch overhead. Could you please share any solutions or workarounds that have worked for you? Thank you!
Hi @void-main, First of all, thank you very much for your suggestions! I have a couple of questions. In my scenario, I have dynamic shaped inputs, so I wonder if...
@sleepwalker2017 If you don't need to debug CUDA codes, you can remove the -G option from CMAKE_CUDA_FLAGS_DEBUG