[Feature] 是否支持在ascend 310p上部署internvl3
Motivation
在ascend 310p上部署internvl3
Related resources
No response
Additional context
No response
5.1节后版本就会支持
现在的版本是支持Internvl3-8b的
您好,请教一下310p部署internvl3-8b的速度和显存大概是多少呢?我这边测试的速度不是很理想。
310P部署internvl3-8b的速度和显存我们还没有系统测试过,能分享一下你的测试结果吗?
使用的两卡部署,通过http接口请求,整体时间超过1min, 从日志输出从开始到结束一共用时超过20s。
curl -X POST http://127.0.0.1:23333/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{ "model": "InternVL3-8B", "messages": [{ "role": "user", "content": [ { "type": "text", "text": "Describe the image please" }, { "type": "image_url", "image_url": { "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" } } ] }], "temperature": 0.8, "top_p": 0.8 }'
日志的输出 root@hostname-fgioq:/home# lmdeploy serve api_server InternVL3-8B --backend pytorch --device ascend --tp 2 --dtype float16 --log-level INFO /usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty *************************************************************************************************************
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
2025-05-29 03:43:44,583 - lmdeploy - WARNING - init.py:10 - Disable DLSlime Backend
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator())
2025-05-29 03:43:50,793 - lmdeploy - INFO - builder.py:64 - matching vision model: InternVLVisionModel
2025-05-29 03:43:51,412 - lmdeploy - INFO - internvl.py:90 - using InternVL-Chat-V1-5 vision preprocess
2025-05-29 03:43:51,417 - lmdeploy - INFO - async_engine.py:263 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='float16', tp=2, dp=1, dp_rank=0, ep=1, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='ascend', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None, empty_init=False, enable_microbatch=False, enable_eplb=False, role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>)
2025-05-29 03:43:51,417 - lmdeploy - INFO - async_engine.py:264 - input chat_template_config=None
2025-05-29 03:43:51,432 - lmdeploy - INFO - async_engine.py:273 - updated chat_template_onfig=ChatTemplateConfig(model_name='internvl2_5', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-05-29 03:43:52,003 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: [4.33.0 ~ 4.49.0], but found version: 4.52.3
2025-05-29 03:43:52,033 - lmdeploy - INFO - init.py:18 - device=ascend does not support ray. distributed_executor_backend=mp.
2025-05-29 03:43:52,033 - lmdeploy - INFO - init.py:87 - Build
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator())
2025-05-29 03:44:03,156 - lmdeploy - WARNING - init.py:10 - Disable DLSlime Backend
2025-05-29 03:44:05,800 - lmdeploy - INFO - base.py:176 - Building Model.
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:301: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:260: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator())
[rank0]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank1]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
Loading weights from safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.58s/it]2025-05-29 03:44:37,970 - lmdeploy - INFO - base.py:178 - Updating configs.
2025-05-29 03:44:38,049 - lmdeploy - INFO - base.py:180 - Building GraphRunner and warmup ops, please waiting.
/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/graph_runner.py:65: RuntimeWarning:
Graph mode is an experimental feature. We currently support both dense and Mixture of Experts (MoE) models with bf16 and fp16 data types. If graph mode does not function correctly with your model, please consider using eager mode as an alternative.
warnings.warn( /opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/graph_runner.py:65: RuntimeWarning:
Graph mode is an experimental feature. We currently support both dense and Mixture of Experts (MoE) models with bf16 and fp16 data types. If graph mode does not function correctly with your model, please consider using eager mode as an alternative.
warnings.warn(
2025-05-29 03:44:38,055 - lmdeploy - INFO - base.py:182 - Building CacheEngine with config:
CacheConfig(max_batches=256, block_size=64, num_cpu_blocks=0, num_gpu_blocks=11385, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=8192, enable_prefix_caching=False, quant_policy=0, device_type='ascend', role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>).
2025-05-29 03:44:38,187 - lmdeploy - INFO - base.py:184 - Warming up model.
2025-05-29 03:44:38,193 - lmdeploy - INFO - async_engine.py:287 - updated backend_config=PytorchEngineConfig(dtype='float16', tp=2, dp=1, dp_rank=0, ep=1, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='ascend', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, distributed_executor_backend=None, empty_init=False, enable_microbatch=False, enable_eplb=False, role=<EngineRole.Hybrid: 1>, migration_backend=<MigrationBackend.DLSlime: 1>)
2025-05-29 03:44:38,733 - lmdeploy - WARNING - tokenizer.py:499 - The token <|action_end|>, its length of indexes [27, 91, 1311, 6213, 91, 29] is over than 1. Currently, it can not be used as stop words
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [24]
INFO: Waiting for application startup.
2025-05-29 03:44:38,770 - lmdeploy - INFO - engine.py:1102 - Starting executor.
2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1106 - Starting async task MainLoopPreprocessMessage.
2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1113 - Starting async task MainLoopResponse.
2025-05-29 03:44:38,771 - lmdeploy - INFO - engine.py:1118 - Starting async task MigrationLoop.
2025-05-29 03:44:38,772 - lmdeploy - INFO - engine.py:1131 - Starting async task MainLoop.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
2025-05-29 03:47:03,027 - lmdeploy - INFO - logger.py:45 - session=1, adapter_name=None, input_tokens=1844, gen_config=GenerationConfig(n=1, max_new_tokens=None, do_sample=True, top_p=0.8, top_k=40, min_p=0.0, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=3188887793734898066, stop_words=None, bad_words=None, stop_token_ids=[151645], bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None, with_cache=False, preserve_cache=False, migration_request=None), prompt='<|im_start|>system\n你是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。<|im_end|>\n<|im_start|>user\n<IMAGE_TOKEN>\nDescribe the image please<|im_end|>\n<|im_start|>assistant\n', prompt_token_id=[151644, 8948, 198, 105043, 90286, 21287, 13935, 116669, 3837, 105205, 13072, 20412, 67916, 30698, 3837, 104625, 100633, 104455, 104800, 5373, 109065, 81217, 104581, 99721, 75317, 101101, 100013, 9370, 42140, 53772, 35243, 26288, 102064, 104949, 1773, 151645, 198, 151644, 872, 198, 151665, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151666, 198, 74785, 279, 2168, 4486, 151645, 198, 151644, 77091, 198]
2025-05-29 03:47:03,027 - lmdeploy - INFO - async_engine.py:684 - session=1, history_tokens=0, input_tokens=1844, max_new_tokens=None, seq_start=True, seq_end=True, step=0, prep=True
2025-05-29 03:47:03,028 - lmdeploy - INFO - request.py:296 - Receive ADD_SESSION Request: 1
2025-05-29 03:47:03,028 - lmdeploy - INFO - request.py:296 - Receive ADD_MESSAGE Request: 1
2025-05-29 03:47:03,039 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=True, enable_empty=False
2025-05-29 03:47:03,045 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1844, batch_size=1, is_decoding=False, has_vision=True
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
2025-05-29 03:47:03,094 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1844
2025-05-29 03:47:03,097 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1844
/opt/lmdeploy/lmdeploy/pytorch/engine/logits_process.py:333: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
stop_mask = torch.where(self.ignore_eos[:, None], stop_mask, False)
.('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
2025-05-29 03:47:33,787 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:47:33,789 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:47:33,798 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:47:33,798 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:15,912 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:15,914 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:15,922 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:15,922 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:17,745 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:17,746 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:17,756 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:17,756 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:19,577 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:19,579 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:19,587 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:19,588 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:21,833 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:21,835 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:21,844 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:21,846 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:23,493 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:23,495 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:23,503 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:23,503 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:24,896 - lmdeploy - INFO - engine.py:872 - Make forward inputs with prefill=False, enable_empty=False
2025-05-29 03:48:24,898 - lmdeploy - INFO - engine.py:237 - Sending forward inputs: num_tokens=1, batch_size=1, is_decoding=True, has_vision=False
2025-05-29 03:48:24,907 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[0]: batch_size=1 num_tokens=1
2025-05-29 03:48:24,907 - lmdeploy - INFO - ascend.py:101 - <ForwardTask> rank[1]: batch_size=1 num_tokens=1
2025-05-29 03:48:25,788 - lmdeploy - INFO - async_engine.py:799 - session 1 finished, reason "stop", input_tokens 1844, outupt_tokens 101
INFO: 127.0.0.1:45682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2025-05-29 03:48:25,791 - lmdeploy - INFO - request.py:296 - Receive END_SESSION Request: 1
环境是按照文档的build的镜像,具体信息如下
docker run -e ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -t lmdeploy-aarch64-ascend:latest lmdeploy check_env
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True. (function operator())
sys.platform: linux
Python: 3.10.5 (main, May 28 2025, 01:44:16) [GCC 9.4.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.3.1
PyTorch compiling details: PyTorch built with:
- GCC 10.2
- C++ Version: 201703
- Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: NO AVX
- Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=2.3.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.18.1 LMDeploy: 0.8.0+51e5dbf transformers: 4.52.3 gradio: Not Found fastapi: 0.115.12 pydantic: 2.11.5 triton: Not Found
宿主机环境:
lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
Stepping: 0x1
Frequency boost: disabled
CPU max MHz: 2600.0000
CPU min MHz: 200.0000
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Caches (sum of all):
L1d: 6 MiB (96 instances)
L1i: 6 MiB (96 instances)
L2: 48 MiB (96 instances)
L3: 96 MiB (4 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Not affected
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
npu-smi info
310p是图模式,需要预热(前几次可能会有编译过程)。建议连续多尝试几次看下实际速率。 @JackWeiw 我们也压测看下速率吧
目前310P初始化时需要将语言模型部分的权重调用Transdata算子,将张量从ND format转换为NZ format(310P底层ATB算子都需要将Linear的A和B张量转换为NZ格式,所以初始化将Weight转为NZ,避免后续decoding时重复的ND转NZ),再加上图模式需要预热,所以第一次推理响应会有点慢,但是后续响应会变快 分享一下我目前已测试过的Qwen2.5-7B模型2卡的速度
Qwen3-32B模型4卡的速度
目前我们正在尝试用Ray进行310P多卡推理,会有效解决310P推理服务过程中可能出现的卡死情况
BTW, 对于310P设备建议block_size设置为128!