Jee Jee Li
Jee Jee Li
@aldettinger Can you test if #18773 fix your issue?
> @mgoin @jeejeelee Could you help take a look at this PR which adds TP to bnb. > > Wonder whether you can give me a hand in the test...
This might be the same issue as https://github.com/vllm-project/vllm/pull/8329
> It's odd that Qwen2-VL-7B-Instruct-GPTQ-Int4 works while -GPTQ-Int8 does not. The addition of extra bias likely occurred during int8 quantization. I am now fixing this bug
You can try commenting out or deleting : ```python 'device = "cuda" if torch.cuda.is_available() else "cpu" ```
Have you tested triton 3.2.0?
> Thanks. LGTM > > Can you also add a `Co-authored-by: Aaron Pham ` to the description. Done
> Ah we need to gated the copy ovee in `_is_cuda()` only here. > > ```diff > 27fdbeea7 - chore: only gated in CUDA (HEAD -> fix-flash-att-rotray) > > Signed-off-by:...
I remember you mentioned a similar issue a long time ago - has it still not been resolved?
@robertgshaw2-neuralmagic @comaniac There is a potential risk of illegal memory access, I have made changes but have not yet submitted them. Please refer to:[add_device_gurad](https://github.com/jeejeelee/vllm/blob/fix-moe-kernel/csrc/moe_align_block_size_kernels.cu#L115)