flynn comments

Results 11 comments of


                                            flynn

I want to deploy three models, one large language model occupying one GPU, one embedding model and one re-ranking model sharing one GPU, How can I do it？

the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help?

I want to deploy three models, one large language model occupying one GPU, one embedding model and one re-ranking model sharing one GPU, How can I do it？

> the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help? example: used timeslicing，replicas...

为什么使用llama脚本转换qwen1.5权重文件，转出来的文件是model.pth和opmx_params.json两个文件

> 是两个文件，具体是什么问题呢？怎么进一步转换成 onnx 或者 pmx 格式？用 ppl.llm.serving 启动，提升 pmx 或者 onnx 文件不存在

为什么使用llama脚本转换qwen1.5权重文件，转出来的文件是model.pth和opmx_params.json两个文件

> > 怎么进一步转换成 onnx 或者 pmx 格式？用 ppl.llm.serving 启动，提升 pmx 或者 onnx 文件不存在 > > 继续Export.py导出模型，就能获得onnx格式的文件试过了，继续 Export 导出模型，有大量的警告， Warning: The shape interface of opmx::XX（如 ParallelEmbedding、ColumnParallelLinear、Reshape等） type is missing，用转出来的 onnx...

[Bug] `Meta-Llama-3.1-8B-Instruct` triggers "Detected errors during sampling! NaN in the probability." under high concurrency/RPS.

I'm having the same problem with Qwen2.5-32B-Instruct-GPTQ-Int4，--quantization gptq

[Bug] `Meta-Llama-3.1-8B-Instruct` triggers "Detected errors during sampling! NaN in the probability." under high concurrency/RPS.

> I'm having the same problem with Qwen2.5-32B-Instruct-GPTQ-Int4，--quantization gptq try --quantization gptq_marlin，there is no error.

flynn

I want to deploy three models, one large language model occupying one GPU, one embedding model and one re-ranking model sharing one GPU, How can I do it？

I want to deploy three models, one large language model occupying one GPU, one embedding model and one re-ranking model sharing one GPU, How can I do it？

为什么使用llama脚本转换qwen1.5权重文件，转出来的文件是model.pth和opmx_params.json两个文件

为什么使用llama脚本转换qwen1.5权重文件，转出来的文件是model.pth和opmx_params.json两个文件

[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere

[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere

升级2.5后在解析3000页的pdf时候 mineru-api服务内存剧增

升级2.5后在解析3000页的pdf时候 mineru-api服务内存剧增

[Bug] `Meta-Llama-3.1-8B-Instruct` triggers "Detected errors during sampling! NaN in the probability." under high concurrency/RPS.

[Bug] `Meta-Llama-3.1-8B-Instruct` triggers "Detected errors during sampling! NaN in the probability." under high concurrency/RPS.