menggeliu1205

Results 2 issues of menggeliu1205

should add threadIdx.x == 0, when you want to write y_warpsize. Otherwise it will lead the wrong answer.

**System Info** Device: H20 Driver: 550.90.07 **python env:** nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 tensorrt 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.12.0.dev2024071600 **model** Qwen14B: https://huggingface.co/Qwen/Qwen-14B **expected behavior** Expect...

functionality issue