lmdeploy [Feature] 是否支持在ascend 310p上部署internvl3

Motivation

在ascend 310p上部署internvl3

Related resources

No response

Additional context

No response

Apr 29 '25 07:04 David1-git

5.1节后版本就会支持

Apr 30 '25 15:04 jinminxi104

现在的版本是支持Internvl3-8b的

May 09 '25 05:05 JackWeiw

您好，请教一下310p部署internvl3-8b的速度和显存大概是多少呢？我这边测试的速度不是很理想。

May 29 '25 03:05 Accchenn

310P部署internvl3-8b的速度和显存我们还没有系统测试过，能分享一下你的测试结果吗？

May 29 '25 04:05 JackWeiw

使用的两卡部署，通过http接口请求，整体时间超过1min，从日志输出从开始到结束一共用时超过20s。

curl -X POST http://127.0.0.1:23333/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{ "model": "InternVL3-8B", "messages": [{ "role": "user", "content": [ { "type": "text", "text": "Describe the image please" }, { "type": "image_url", "image_url": { "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" } } ] }], "temperature": 0.8, "top_p": 0.8 }'

日志的输出 root@hostname-fgioq:/home# lmdeploy serve api_server InternVL3-8B --backend pytorch --device ascend --tp 2 --dtype float16 --log-level INFO /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty *************************************************************************************************************

warnings.warn(msg, ImportWarning) /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) 2025-05-29 03:43:44,583 - lmdeploy - WARNING - init.py:10 - Disable DLSlime Backend [W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator()) 2025-05-29 03:43:50,793 - lmdeploy - INFO - builder.py:64 - matching vision model: InternVLVisionModel 2025-05-29 03:43:51,412 - lmdeploy - INFO - internvl.py:90 - using InternVL-Chat-V1-5 vision preprocess 2025-05-29 03:43:51,417 - lmdeploy - INFO - async_engine.py:263 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='float16', tp=2, dp=1, dp_rank=0, ep=1, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='ascend', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None, empty_init=False, enable_microbatch=False, enable_eplb=False, role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>) 2025-05-29 03:43:51,417 - lmdeploy - INFO - async_engine.py:264 - input chat_template_config=None 2025-05-29 03:43:51,432 - lmdeploy - INFO - async_engine.py:273 - updated chat_template_onfig=ChatTemplateConfig(model_name='internvl2_5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None) 2025-05-29 03:43:52,003 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: [4.33.0 ~ 4.49.0], but found version: 4.52.3 2025-05-29 03:43:52,033 - lmdeploy - INFO - init.py:18 - device=ascend does not support ray. distributed_executor_backend=mp. 2025-05-29 03:43:52,033 - lmdeploy - INFO - init.py:87 - Build executor. 2025-05-29 03:43:52,035 - lmdeploy - INFO - dist_utils.py:29 - MASTER_ADDR=127.0.0.1, MASTER_PORT=51441 2025-05-29 03:43:52,052 - lmdeploy - INFO - mp_executor.py:249 - Creating processes. 2025-05-29 03:43:56,257 - lmdeploy - WARNING - init.py:10 - Disable DLSlime Backend /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty *************************************************************************************************************

warnings.warn(msg, ImportWarning) /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) [W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator()) 2025-05-29 03:44:03,156 - lmdeploy - WARNING - init.py:10 - Disable DLSlime Backend 2025-05-29 03:44:05,800 - lmdeploy - INFO - base.py:176 - Building Model. /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty *************************************************************************************************************

warnings.warn(msg, ImportWarning) /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) [W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator()) [rank0]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [rank1]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) Loading weights from safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.58s/it]2025-05-29 03:44:37,970 - lmdeploy - INFO - base.py:178 - Updating configs. 2025-05-29 03:44:38,049 - lmdeploy - INFO - base.py:180 - Building GraphRunner and warmup ops, please waiting. /opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/graph_runner.py:65: RuntimeWarning:

Graph mode is an experimental feature. We currently support both dense and Mixture of Experts (MoE) models with bf16 and fp16 data types. If graph mode does not function correctly with your model, please consider using eager mode as an alternative.

warnings.warn( /opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/graph_runner.py:65: RuntimeWarning:

Graph mode is an experimental feature. We currently support both dense and Mixture of Experts (MoE) models with bf16 and fp16 data types. If graph mode does not function correctly with your model, please consider using eager mode as an alternative.

warnings.warn( 2025-05-29 03:44:38,055 - lmdeploy - INFO - base.py:182 - Building CacheEngine with config: CacheConfig(max_batches=256, block_size=64, num_cpu_blocks=0, num_gpu_blocks=11385, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=8192, enable_prefix_caching=False, quant_policy=0, device_type='ascend', role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>). 2025-05-29 03:44:38,187 - lmdeploy - INFO - base.py:184 - Warming up model. 2025-05-29 03:44:38,193 - lmdeploy - INFO - async_engine.py:287 - updated backend_config=PytorchEngineConfig(dtype='float16', tp=2, dp=1, dp_rank=0, ep=1, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='ascend', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None, empty_init=False, enable_microbatch=False, enable_eplb=False, role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>) 2025-05-29 03:44:38,733 - lmdeploy - WARNING - tokenizer.py:499 - The token <|action_end|>, its length of indexes [27, 91, 1311, 6213, 91, 29] is over than 1. Currently, it can not be used as stop words HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! INFO: Started server process [24] INFO: Waiting for application startup. 2025-05-29 03:44:38,770 - lmdeploy - INFO - engine.py:1102 - Starting executor. 2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1106 - Starting async task MainLoopPreprocessMessage. 2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1113 - Starting async task MainLoopResponse. 2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1118 - Starting async task MigrationLoop. 2025-05-29 03:44:38,772 - lmdeploy - INFO - engine.py:1131 - Starting async task MainLoop. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit) 2025-05-29 03:47:03,027 - lmdeploy - INFO - logger.py:45 - session=1, adapter_name=None, input_tokens=1844, gen_config=GenerationConfig(n=1, max_new_tokens=None, do_sample=True, top_p=0.8, top_k=40, min_p=0.0, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=3188887793734898066, stop_words=None, bad_words=None, stop_token_ids=[151645], bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None, with_cache=False, preserve_cache=False, migration_request=None), prompt='<|im_start|>system\n你是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。<|im_end|>\n<|im_start|>user\n<IMAGE_TOKEN>\nDescribe the image please<|im_end|>\n<|im_start|>assistant\n', prompt_token_id=[151644, 8948, 198, 105043, 90286, 21287, 13935, 116669, 3837, 105205, 13072, 20412, 67916, 30698, 3837, 104625, 100633, 104455, 104800, 5373, 109065, 81217, 104581, 99721, 75317, 101101, 100013, 9370, 42140, 53772, 35243, 26288, 102064, 104949, 1773, 151645, 198, 151644, 872, 198, 151665, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151666, 198, 74785, 279, 2168, 4486, 151645, 198, 151644, 77091, 198] 2025-05-29 03:47:03,027 - lmdeploy - INFO - async_engine.py:684 - session=1, history_tokens=0, input_tokens=1844, max_new_tokens=None, seq_start=True, seq_end=True, step=0, prep=True 2025-05-29 03:47:03,028 - lmdeploy - INFO - request.py:296 - Receive ADD_SESSION Request: 1 2025-05-29 03:47:03,028 - lmdeploy - INFO - request.py:296 - Receive ADD_MESSAGE Request: 1 2025-05-29 03:47:03,039 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=True, enable_empty=False 2025-05-29 03:47:03,045 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1844, batch_size=1, is_decoding=False, has_vision=True ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving') 2025-05-29 03:47:03,094 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1844 2025-05-29 03:47:03,097 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1844 /opt/lmdeploy/lmdeploy/pytorch/engine/logits_process.py:333: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.) stop_mask = torch.where(self.ignore_eos[:, None], stop_mask, False) .('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving') 2025-05-29 03:47:33,787 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:47:33,789 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:47:33,798 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:47:33,798 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:15,912 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:15,914 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:15,922 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:15,922 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:17,745 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:17,746 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:17,756 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:17,756 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:19,577 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:19,579 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:19,587 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:19,588 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:21,833 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:21,835 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:21,844 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:21,846 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:23,493 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:23,495 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:23,503 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:23,503 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:24,896 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False 2025-05-29 03:48:24,898 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False 2025-05-29 03:48:24,907 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1 2025-05-29 03:48:24,907 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1 2025-05-29 03:48:25,788 - lmdeploy - INFO - async_engine.py:799 - session 1 finished, reason "stop", input_tokens 1844, outupt_tokens 101 INFO: 127.0.0.1:45682 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2025-05-29 03:48:25,791 - lmdeploy - INFO - request.py:296 - Receive END_SESSION Request: 1

环境是按照文档的build的镜像，具体信息如下

docker run -e ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -t lmdeploy-aarch64-ascend:latest lmdeploy check_env [W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator()) sys.platform: linux Python: 3.10.5 (main, May 28 2025, 01:44:16) [GCC 9.4.0] CUDA available: False MUSA available: False numpy_random_seed: 2147483648 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 2.3.1 PyTorch compiling details: PyTorch built with:

GCC 10.2
C++ Version: 201703
Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: NO AVX
Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=2.3.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.1 LMDeploy: 0.8.0+51e5dbf transformers: 4.52.3 gradio: Not Found fastapi: 0.115.12 pydantic: 2.11.5 triton: Not Found

宿主机环境：

lscpu Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: HiSilicon BIOS Vendor ID: HiSilicon Model name: Kunpeng-920 BIOS Model name: HUAWEI Kunpeng 920 5250 Model: 0 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 Stepping: 0x1 Frequency boost: disabled CPU max MHz: 2600.0000 CPU min MHz: 200.0000 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm Caches (sum of all):
L1d: 6 MiB (96 instances) L1i: 6 MiB (96 instances) L2: 48 MiB (96 instances) L3: 96 MiB (4 instances) NUMA:
NUMA node(s): 4 NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Not affected Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Not affected Srbds: Not affected Tsx async abort: Not affected

npu-smi info

May 30 '25 01:05 Accchenn

310p是图模式，需要预热（前几次可能会有编译过程）。建议连续多尝试几次看下实际速率。 @JackWeiw 我们也压测看下速率吧

May 30 '25 02:05 jinminxi104

目前310P初始化时需要将语言模型部分的权重调用Transdata算子，将张量从ND format转换为NZ format（310P底层ATB算子都需要将Linear的A和B张量转换为NZ格式，所以初始化将Weight转为NZ，避免后续decoding时重复的ND转NZ），再加上图模式需要预热，所以第一次推理响应会有点慢，但是后续响应会变快分享一下我目前已测试过的Qwen2.5-7B模型2卡的速度

Qwen3-32B模型4卡的速度

目前我们正在尝试用Ray进行310P多卡推理，会有效解决310P推理服务过程中可能出现的卡死情况 BTW, 对于310P设备建议block_size设置为128！

May 30 '25 03:05 JackWeiw