vllm [TPU] Enable gemma3-27b with TP>1 on multi-chips.

This PR enables gemma3-27b with TP>1 on multi-chips. Without the change, it fails with an error:

callstack:

Traceback (most recent call last):
  File "/home/xiowei/vllm/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
    output = func(*args, **kwargs)
  File "/home/xiowei/vllm/vllm/v1/worker/tpu_worker.py", line 160, in determine_available_memory
    self.model_runner.profile_run(self.model_runner.max_num_tokens)
  File "/home/xiowei/vllm/vllm/v1/worker/tpu_model_runner.py", line 1166, in profile_run
    dummy_encoder_outputs = self.model.get_multimodal_embeddings(
  File "/home/xiowei/vllm/vllm/model_executor/models/gemma3_mm.py", line 588, in get_multimodal_embeddings
    return self._process_image_input(image_input)
  File "/home/xiowei/vllm/vllm/model_executor/models/gemma3_mm.py", line 569, in _process_image_input
    image_features = self._image_pixels_to_features(
  File "/home/xiowei/vllm/vllm/model_executor/models/gemma3_mm.py", line 557, in _image_pixels_to_features
    image_features = vision_tower(pixel_values.to(dtype=target_dtype))
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xiowei/vllm/vllm/model_executor/models/siglip.py", line 477, in forward
    return self.vision_model(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xiowei/vllm/vllm/model_executor/models/siglip.py", line 419, in forward
    hidden_states = self.embeddings(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xiowei/vllm/vllm/model_executor/models/siglip.py", line 135, in forward
    embeddings = embeddings + self.position_embedding(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xiowei/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 406, in forward
    masked_input, input_mask = get_masked_input_and_mask(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 671, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 768, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 753, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1357, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1246, in codegen_and_compile
    compiled_module = graph.compile_to_module()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2201, in compile_to_module
    return self._compile_to_module()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2209, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/graph.py", line 2140, in codegen
    self.init_wrapper_code()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1898, in init_wrapper_code
    self.device_ops = get_device_op_overrides(self.device_type)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/codegen/common.py", line 490, in get_device_op_overrides
    return device_op_overrides_dict[device]
torch._inductor.exc.InductorError: KeyError: 'xla'

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Test plan: pytest -s -vv tests/v1/tpu/test_basic.py -k test_gemma3_with_mm_on_multichip 2>&1 | tee ~/out.txt

Apr 28 '25 23:04 vanbasten23

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 28 '25 23:04 github-actions[bot]

cc: @bvrockwell @yarongmu-google

Apr 29 '25 00:04 vanbasten23

Somehow, I still couldn't see my TPU CI running (Is it because all the tests are run in sequence and a CI before the TPU CI gets stuck and blocks the TPU CI?) nor could I start the TPU CI myself (The "Run TPU V1 Tests" button is gray.)

May 01 '25 13:05 vanbasten23

The failing CI seems to have the symptom of timeout. I don't see why my PR would cause that.

May 02 '25 22:05 vanbasten23

I retried the failing tests, but I think we can merge ignoring those timeouts

May 02 '25 22:05 mgoin

Thanks @mgoin . I also did some check on my a100 VM. For the 2 failing tests:

VLLM_USE_V1=1 pytest -s -vv tests/mq_llm_engine/test_error_handling.py::test_mp_crash_detection: it fails on the main branch (4c33d6732148fdaeb9780fa86fca1f87f2a93c19)
VLLM_USE_V1=1 pytest -s -vv tests/v1/engine/test_engine_core_client.py -k test_startup_failure: it succeeds on my branch xiowei/gemma3-27b-multi-chip

Could you help merge the PR? Thanks!

May 05 '25 05:05 vanbasten23

Nice improvement and TPU V1 test is green!

May 05 '25 21:05 mgoin

vllm vllm copied to clipboard

[TPU] Enable gemma3-27b with TP>1 on multi-chips.

vllm
vllm copied to clipboard