Wei Tao

Results 10 comments of Wei Tao

> Hi team, I was wondering if we have any update on this issue? Hello, do you have some idea about the performance degrassion? I have test the performance of...

@gedoensmax Sir, one thing i am confused is that if i install onnxruntime by pip install onnxruntime-gpu==1.17, would the onnxruntime package be the optimum one (i mean it will match...

> The default 1.17 shipment is with CUDA 11. To install onnxruntime with CUDA 12 there is a separate package. https://onnxruntime.ai/docs/install/#install-onnx-runtime-gpu-cuda-11x OK, Thank you very much. Can you please take...

> > ascend上多卡卡死的问题还是没有彻底解决。 [#3513](https://github.com/InternLM/lmdeploy/pull/3513) 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。 我在cann8.1.beta1的环境下,测试了qwen2.5-3b模型,对于eager模式和图模式,都会大概率会卡死。单卡则eager模式和图模式都正常。 > > python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2 > > 初步看来可以用ray启动来解决这个问题,我们也还在进一步压测,大家可以试试 单机多卡 > > 1. 启动ray > export...

> [@JackWeiw](https://github.com/JackWeiw) 按照你的方法,在310P单机多卡的环境下进行测试,结果如下: > > 环境: ascend 300v pro双卡 cann 8.1.RC1 [DeepLink-org/dlinfer#219](https://github.com/DeepLink-org/dlinfer/pull/219) 之后的dlinfer,并且增加了[DeepLink-org/dlinfer#225](https://github.com/DeepLink-org/dlinfer/pull/225) #227的补丁。 最新lmdepoly > > export LMDEPLOY_EXECUTOR_BACKEND=ray export ASCEND_RANK_TABLE_FILE_PATH=ranktable.json python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device...

现在的版本是支持Internvl3-8b的

310P部署internvl3-8b的速度和显存我们还没有系统测试过,能分享一下你的测试结果吗?

目前310P初始化时需要将语言模型部分的权重调用Transdata算子,将张量从ND format转换为NZ format(310P底层ATB算子都需要将Linear的A和B张量转换为NZ格式,所以初始化将Weight转为NZ,避免后续decoding时重复的ND转NZ),再加上图模式需要预热,所以第一次推理响应会有点慢,但是后续响应会变快 分享一下我目前已测试过的Qwen2.5-7B模型2卡的速度 ![Image](https://github.com/user-attachments/assets/6692f098-0906-47e4-8871-9a15bf906539) Qwen3-32B模型4卡的速度 ![Image](https://github.com/user-attachments/assets/079db447-6e43-4a53-8ec7-324a23061e4e) 目前我们正在尝试用Ray进行310P多卡推理,会有效解决310P推理服务过程中可能出现的卡死情况 BTW, 对于310P设备建议block_size设置为128!

I updated my script like examples in Disc torch inference do, another problem occured ![捕获](https://github.com/alibaba/BladeDISC/assets/126441921/3242cfa5-fbfb-4bb7-8a40-48df8a4d09a4) your kindly help is much appriciated!!! @Yancey1989 @eedalong

I passed the half precission model to blade_disc, however, the saved optimized model by blade_disc is fp32, how come? ![image](https://github.com/alibaba/BladeDISC/assets/126441921/a80f7a3d-e002-4f36-a50a-91f753e27ff5)