inference 模型选择none量化，指定到特定GPU index，无法正常启动，始终报GPU 0的资源不足。

我的机器有8卡，其中4,5 卡的资源空闲，其他卡非空闲。模型类型为pytorch，看了官网说明要量化类型为none，才会启动vLLM引擎。所以我就设置了量化类型为none，并且指定到了空闲的GPU 卡上，但是启动不来。

步骤和结果：

[!IMPORTANT] 注意：我GPU0 资源不足， GPU4, GPU5都是空闲的。模型：qwen1.5-chat ， size： 14 ，类型:pytorch

步骤1 ：如果我选择量化为4或者 8 ，指定到gpu4 ，则均能正常启动。

步骤2：如果我选择量化为none，指定到gpu4，则会报gpu0 资源不够

步骤3：如果我选择量化为none，指定gpu到4和5两个，则正常启动，vllm引擎也加载。

错误日志

2024-04-25 01:35:03,581 xinference.core.worker 95 INFO     You specify to launch the model: qwen1.5-chat on GPU index: [4] of the worker: 0.0.0.0:36158, xinference will automatically ignore the `n_gpu` option.
2024-04-25 01:35:07,395 xinference.model.llm.llm_family 95 INFO     Caching from Modelscope: qwen/Qwen1.5-14B-Chat
2024-04-25 01:35:07,395 xinference.model.llm.llm_family 95 INFO     Cache /root/.xinference/cache/qwen1.5-chat-pytorch-14b exists
2024-04-25 01:35:07,405 xinference.model.llm.vllm.core 23146 INFO     Loading qwen1.5-chat with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}
INFO 04-25 01:35:07 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/root/.xinference/cache/qwen1.5-chat-pytorch-14b', tokenizer='/root/.xinference/cache/qwen1.5-chat-pytorch-14b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-25 01:35:08 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-25 01:35:08 selector.py:25] Using XFormers backend.
2024-04-25 01:35:09,629 xinference.core.worker 95 ERROR    Failed to load model qwen1.5-chat-1-0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 697, in launch_builtin_model
    await model_ref.load()
  File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
    self._model.load()
  File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 178, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 37, in __init__
    self._init_worker()
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 107, in load_model
    self.model_runner.load_model()
  File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 95, in load_model
    self.model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 81, in get_model
    model = model_class(model_config.hf_config, linear_method,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 298, in __init__
    self.model = Qwen2Model(config, linear_method)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 237, in __init__
    self.layers = nn.ModuleList([
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 238, in <listcomp>
    Qwen2DecoderLayer(config, layer_idx, linear_method)
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 172, in __init__
    self.self_attn = Qwen2Attention(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 116, in __init__
    self.qkv_proj = QKVParallelLinear(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 386, in __init__
    super().__init__(input_size, output_size, bias, False, skip_bias_add,
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 181, in __init__
    self.linear_weights = self.linear_method.create_weights(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 63, in create_weights
    weight = Parameter(torch.empty(output_size_per_partition,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: [address=0.0.0.0:37477, pid=23146] CUDA out of memory. Tried to allocate 150.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 22.69 MiB is free. Process 1555924 has 23.67 GiB memory in use. Of the allocated memory 23.21 GiB is allocated by PyTorch, and 12.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 25 '24 02:04 RandyChen1985

根据一些用户反馈，出现这种问题很有可能是 cuda 驱动的问题，导致实际上的 cuda 序号是对应不上的。

一个用户反馈他检查了环境，发现没有安装 nvidia-cuda-toolkit，然后安装了以后重装 conda 问题解决。

Apr 25 '24 02:04 qinxuye

为什么只有模型量化为none的时候，才会启动vLLM引擎呢，这个逻辑是什么呢？

Apr 25 '24 06:04 RandyChen1985

之前也遇到了同样的问题，不过我无论是否选择量化都会使用gpu0.不指定模型，在部署第二个模型时也会自动继续选用gpu0。

nvidia-cuda-toolkit是安装了的

之前重新创建一个conda虚拟环境后暂时修复了，不过今天在新环境的xinference包修改添加了qwen-110b的模型配置后又出现了同样的问题。期间依次进行了如下操作： 1.修改conda环境内xinference/llm_family_modelscope.json文件； 2.关闭之前运行的两个模型； 3.关闭xinference worker； 4.关闭xinference supervisor； 5.设置下载地址为modelscope； 6.重新启动supervisor，worker； 7.重新部署之前的两个模型，报错

May 06 '24 07:05 YYLCyylc

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 19:08 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Aug 12 '24 03:08 github-actions[bot]