[Bug][V1]: TP is broken when torch compile cache is used

Open WoosukKwon opened this issue 10 months ago • 0 comments

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

Got the error message when using tp_size=4:

(VllmWorker rank=2 pid=2307184) ERROR 02-17 14:48:01 multiproc_executor.py:374] ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Importantly, the bug doesn't happen when the torch.compile cache is not used.

The error raises at the first torch.compile-generated op for the embedding layer:

    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s0, 4096), (4096, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [ge, lt, and_, ge_1, lt_1, and__1, or_, masked_fill_, mul, mul_1, add, sub, mul_2, embedding], Original ATen: [aten.ge, aten.lt, aten.bitwise_and, aten.bitwise_or, aten.masked_fill, aten.mul, aten.add, aten.sub, aten.embedding]
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel = 4096*s0
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0.run(arg0_1, arg2_1, buf0, triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel, grid=grid(triton_poi_fused_add_bitwise_and_bitwise_or_embedding_ge_lt_masked_fill_mul_sub_0_xnumel), stream=stream0)

Here, the input arguments (arg0_1 and arg2_1, which correspond to input activations and weights) live in cuda:{rank}, while the output tensor (buf0) lives in cuda:0 regardless of the actual ranks.

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 17 '25 22:02 WoosukKwon