Auto scheduling will run DeepSeek-v3.2 with TP=7 and failed to start

Open pengjiang80 opened this issue 1 month ago • 1 comments
GPUStack version

v2.0.0
Operating System & CPU Architecture

Ubuntu22.04
GPU

H200x8
▶️ Steps to reproduce

Run deepseek-v3.2 with 8xH200 will set TP=7
2025-12-02 13:33:57.757875+08:00 - gpustack.worker.backends.base - ERROR - Failed to get pretrained config: The checkpoint you are trying to load has model type `deepseek_v32` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
2025-12-02 13:33:57.757948+08:00 - gpustack.utils.hub - WARNING - The model's config.json does not contain any of the following keys to determine the original maximum length of the model: ['max_position_embeddings', 'n_positions', 'max_seq_len', 'seq_length', 'model_max_length', 'max_target_positions', 'max_sequence_length', 'max_seq_length', 'seq_len']. Assuming the model's maximum length is 2048.
2025-12-02 13:33:57.858626+08:00 - gpustack.worker.backends.vllm - INFO - Creating vLLM container workload: deepseek-v3.2-PjmZS
2025-12-02 13:33:57.858699+08:00 - gpustack.worker.backends.vllm - INFO - With image: gpustack/runner:cuda12.8-vllm0.11.0, arguments: [vllm serve /var/lib/gpustack/cache/huggingface/deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 7 --host 162.243.56.76 --port 40034 --served-model-name deepseek-v3.2], ports: [40034], envs(inconsistent input items mean unchangeable):
2025-12-02 13:33:57.875286+08:00 - 1341 - gpustack_runtime.deployer.docker - WARNING - Mirrored deployment enabled, but no Container name set, using hostname(pool-4-q6pelzbd) instead
2025-12-02 13:33:57.878417+08:00 - 1341 - gpustack_runtime.deployer.docker - INFO - Mirrored deployment enabled, using self Container 9f500aedc9acbfb2309191e7edd1c5467d2aa77177da148a63f7b0617d99731c for options mirroring
2025-12-02 13:33:59.624458+08:00 - gpustack.worker.backends.vllm - INFO - Created vLLM container workload deepseek-v3.2-PjmZS
2025-12-02 13:33:59.624545+08:00 - gpustack.worker.serve_manager - INFO - Finished provisioning model instance deepseek-v3.2-PjmZS
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 12-02 13:34:03 [__init__.py:216] Automatically detected platform cuda.
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:09 [api_server.py:1839] vLLM API server version 0.11.0
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:09 [utils.py:233] non-default args: {'model_tag': '/var/lib/gpustack/cache/huggingface/deepseek-ai/DeepSeek-V3.2', 'host': '162.243.56.76', 'port': 40034, 'model': '/var/lib/gpustack/cache/huggingface/deepseek-ai/DeepSeek-V3.2', 'served_model_name': ['deepseek-v3.2'], 'tensor_parallel_size': 7}
[1;36m(APIServer pid=7)[0;0m You are using a model of type deepseek_v32 to instantiate a model of type deepseek_v3. This is not supported for all configurations of models and can yield errors.
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:09 [config.py:617] Detected quantization_config.scale_fmt=ue8m0; enabling Hopper UE8M0.
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:09 [config.py:388] Replacing legacy 'type' key with 'rope_type'
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:13 [model.py:547] Resolved architecture: DeepseekV32ForCausalLM
[1;36m(APIServer pid=7)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:13 [model.py:1510] Using max model len 163840
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:14 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(APIServer pid=7)[0;0m INFO 12-02 13:34:14 [config.py:422] Using custom fp8 kv-cache format for DeepSeekV3.2
[1;36m(APIServer pid=7)[0;0m Traceback (most recent call last):
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/bin/vllm", line 10, in <module>
[1;36m(APIServer pid=7)[0;0m     sys.exit(main())
[1;36m(APIServer pid=7)[0;0m              ^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 54, in main
[1;36m(APIServer pid=7)[0;0m     args.dispatch_function(args)
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 57, in cmd
[1;36m(APIServer pid=7)[0;0m     uvloop.run(run_server(args))
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
[1;36m(APIServer pid=7)[0;0m     return __asyncio.run(
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[1;36m(APIServer pid=7)[0;0m     return runner.run(main)
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[1;36m(APIServer pid=7)[0;0m     return self._loop.run_until_complete(task)
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
[1;36m(APIServer pid=7)[0;0m     return await main
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
[1;36m(APIServer pid=7)[0;0m     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
[1;36m(APIServer pid=7)[0;0m     async with build_async_engine_client(
[1;36m(APIServer pid=7)[0;0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[1;36m(APIServer pid=7)[0;0m     return await anext(self.gen)
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
[1;36m(APIServer pid=7)[0;0m     async with build_async_engine_client_from_engine_args(
[1;36m(APIServer pid=7)[0;0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[1;36m(APIServer pid=7)[0;0m     return await anext(self.gen)
[1;36m(APIServer pid=7)[0;0m            ^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
[1;36m(APIServer pid=7)[0;0m     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
[1;36m(APIServer pid=7)[0;0m                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1431, in create_engine_config
[1;36m(APIServer pid=7)[0;0m     config = VllmConfig(
[1;36m(APIServer pid=7)[0;0m              ^^^^^^^^^^^
[1;36m(APIServer pid=7)[0;0m   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
[1;36m(APIServer pid=7)[0;0m     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
[1;36m(APIServer pid=7)[0;0m pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
[1;36m(APIServer pid=7)[0;0m   Value error, Total number of attention heads (128) must be divisible by tensor parallel size (7). [type=value_error, input_value=ArgsKwargs((), {'model_co...additional_config': {}}), input_type=ArgsKwargs]
[1;36m(APIServer pid=7)[0;0m     For further information visit https://errors.pydantic.dev/2.12/v/value_error
❌ Actual result

No response
Dec 02 '25 05:12 pengjiang80
Refer to https://github.com/gpustack/gpustack/issues/814
Dec 02 '25 05:12 pengjiang80