q yao

Results 318 comments of q yao

> I'm curious to know if there's any plan to bring TurboMind support to smaller models like Intervl2-1b @lvhan028 @lzhangzz

> File "/opt/conda/lib/python3.8/site-packages/triton/compiler/backends/cuda.py", line 173, in make_llir ret = translate_triton_gpu_to_llvmir(src, capability, tma_infos, runtime.TARGET.NVVM) The triton kernel compilation failed on your device. What is your triton version?

Thanks for the report. I will fix it soon.

lmdeploy: 0.7.3+ is an old version, please upgrade and try again. And you have not explicitly set the PyTorch engine backend. Some models might dispatch to Turbomind.

`--backend pytorch` Turbomind has better performance.

这个 block_size 是 paged attention 中的 block 大小,是引擎的一个配置参数 https://github.com/InternLM/lmdeploy/blob/5f0647f1181312975f05d16eeb166d5a69afb6ef/lmdeploy/messages.py#L342 通常是要求必须是2的指数次的,如果不这么做,那么 fill_kv_cache / paged_attention 等很多模块/kernel都会受到影响,对性能没什么好处(要在 kernel 中加更多边界检查;attention 中的 tensorcore 使用也会更复杂)。因此这里对 block size 其实是有隐式的假设的,也许在启动引擎的检查中就应该加个断言?

> Does this mean that only modules that formed in package and can be imported by `importlib` were supported? In many cases, users only write a model in a script...

`ray` would store the logs on the file system. You can clear your disk or ignore the warning.

> 使用lmdeploy教程中的指令起服务 > lmdeploy serve api_server root/workspace/personal_data/LLM_models/Qwen3-8B --adapters > mylora=/root/workspace/personal_data/lora_model则会报错,报错内容为lmdeploy serve api_server: error: the following arguments are required: model_path 我用类似的方法可以起,你是不是路径里漏了 `/`? 客户端可以用 model=mylora 字段来选择激活的 adapter的

> error: Try build wheel on device with network available.