Wei Tao comments

Results 10 comments of


                                             Wei Tao

[Performance] Regression observed when using CUDA execution provider

> Hi team, I was wondering if we have any update on this issue? Hello, do you have some idea about the performance degrassion? I have test the performance of...

[Performance] Regression observed when using CUDA execution provider

@gedoensmax Sir, one thing i am confused is that if i install onnxruntime by pip install onnxruntime-gpu==1.17, would the onnxruntime package be the optimum one (i mean it will match...

[Performance] Regression observed when using CUDA execution provider

> The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x OK, Thank you very much. Can you please take...

[Bug] Ascend v0.7.2.post1，对serving api测速，概率性卡死

> > ascend上多卡卡死的问题还是没有彻底解决。 [#3513](https://github.com/InternLM/lmdeploy/pull/3513) 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。我在cann8.1.beta1的环境下，测试了qwen2.5-3b模型，对于eager模式和图模式，都会大概率会卡死。单卡则eager模式和图模式都正常。 > > python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2 > > 初步看来可以用ray启动来解决这个问题，我们也还在进一步压测，大家可以试试单机多卡 > > 1. 启动ray > export...

[Bug] Ascend v0.7.2.post1，对serving api测速，概率性卡死

> [@JackWeiw](https://github.com/JackWeiw) 按照你的方法，在310P单机多卡的环境下进行测试，结果如下： > > 环境： ascend 300v pro双卡 cann 8.1.RC1 [DeepLink-org/dlinfer#219](https://github.com/DeepLink-org/dlinfer/pull/219) 之后的dlinfer，并且增加了[DeepLink-org/dlinfer#225](https://github.com/DeepLink-org/dlinfer/pull/225) #227的补丁。最新lmdepoly > > export LMDEPLOY_EXECUTOR_BACKEND=ray export ASCEND_RANK_TABLE_FILE_PATH=ranktable.json python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device...

[Feature] 是否支持在ascend 310p上部署internvl3

现在的版本是支持Internvl3-8b的

[Feature] 是否支持在ascend 310p上部署internvl3

310P部署internvl3-8b的速度和显存我们还没有系统测试过，能分享一下你的测试结果吗？

[Feature] 是否支持在ascend 310p上部署internvl3

目前310P初始化时需要将语言模型部分的权重调用Transdata算子，将张量从ND format转换为NZ format（310P底层ATB算子都需要将Linear的A和B张量转换为NZ格式，所以初始化将Weight转为NZ，避免后续decoding时重复的ND转NZ），再加上图模式需要预热，所以第一次推理响应会有点慢，但是后续响应会变快分享一下我目前已测试过的Qwen2.5-7B模型2卡的速度 ![Image](https://github.com/user-attachments/assets/6692f098-0906-47e4-8871-9a15bf906539) Qwen3-32B模型4卡的速度 ![Image](https://github.com/user-attachments/assets/079db447-6e43-4a53-8ec7-324a23061e4e) 目前我们正在尝试用Ray进行310P多卡推理，会有效解决310P推理服务过程中可能出现的卡死情况 BTW, 对于310P设备建议block_size设置为128！

How to get best performace with optimization with torch_blade

I updated my script like examples in Disc torch inference do, another problem occured ![捕获](https://github.com/alibaba/BladeDISC/assets/126441921/3242cfa5-fbfb-4bb7-8a40-48df8a4d09a4) your kindly help is much appriciated!!! @Yancey1989 @eedalong

How to get best performace with optimization with torch_blade

I passed the half precission model to blade_disc, however, the saved optimized model by blade_disc is fp32, how come? ![image](https://github.com/alibaba/BladeDISC/assets/126441921/a80f7a3d-e002-4f36-a50a-91f753e27ff5)