flynn
flynn
the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help?
> the DRA cannot be used in production environments. Can we modify the code of GPU allocation strategy to implement it? Can anyone give me some help? example: used timeslicing,replicas...
> 是两个文件,具体是什么问题呢? 怎么进一步转换成 onnx 或者 pmx 格式?用 ppl.llm.serving 启动,提升 pmx 或者 onnx 文件不存在
> > 怎么进一步转换成 onnx 或者 pmx 格式?用 ppl.llm.serving 启动,提升 pmx 或者 onnx 文件不存在 > > 继续Export.py导出模型,就能获得onnx格式的文件 试过了,继续 Export 导出模型,有大量的警告, Warning: The shape interface of opmx::XX(如 ParallelEmbedding、ColumnParallelLinear、Reshape等) type is missing,用转出来的 onnx...
version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect.
> version 0.6.4.post1 has the same issue,setting --enable-chunked-prefill=False did not have any effect. after removing the parameter '--enable-prefix-caching', this issue no longer occurs
https://github.com/opendatalab/MinerU/pull/3967 这个 pr 优化了内存,谁审核合并下? @myhloli @skyler9901
> [#3967](https://github.com/opendatalab/MinerU/pull/3967) > > 这个 pr 优化了内存,谁审核合并下? > > [@myhloli](https://github.com/myhloli) [@skyler9901](https://github.com/skyler9901) 持续压测一个 80m(286页)的pdf,优化前内存先到20g,后面还会涨,最终oom,优化后跑了两天,内存不超过2g
I'm having the same problem with Qwen2.5-32B-Instruct-GPTQ-Int4,--quantization gptq
> I'm having the same problem with Qwen2.5-32B-Instruct-GPTQ-Int4,--quantization gptq try --quantization gptq_marlin,there is no error.