vllm
vllm copied to clipboard
[Bug][V1]: TP is broken when torch compile cache is used
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
Got the error message when using tp_size=4:
(VllmWorker rank=2 pid=2307184) ERROR 02-17 14:48:01 multiproc_executor.py:374] ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
Importantly, the bug doesn't happen when the torch.compile cache is not used.
The error raises at the first torch.compile-generated op for the embedding layer:
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((s0, 4096), (4096, 1), torch.bfloat16)
# Topologically Sorted Source Nodes: [ge, lt, and_, ge_1, lt_1, and__1, or_, masked_fill_, mul, mul_1, add, sub, mul_2, embedding], Original ATen: [aten.ge, aten.lt, aten.bitwise_and, aten.bitwise_or, aten.masked_fill, aten.mul, aten.add, aten.sub, aten.embedding]
triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel = 4096*s0
stream0 = get_raw_stream(0)
triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0.run(arg0_1, arg2_1, buf0, triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel, grid=grid(triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel), stream=stream0)
Here, the input arguments (arg0_1 and arg2_1, which correspond to input activations and weights) live in cuda:{rank}, while the output tensor (buf0) lives in cuda:0 regardless of the actual ranks.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.