q yao
q yao
Update tensorrt version or just remove bicubic resize model in `torch2trt_dynamic/converters/grid_sample.py`.
I can not test solo on my device. OOM.
https://github.com/InternLM/lmdeploy/blob/967df47f574056740cb45b52338563373730c144/lmdeploy/pytorch/kernels/cuda/flashattention.py#L498 try manually tunning these arguements.
@cuikaiGitHub 0.8.0 is a quite old version, try switch to our latest release. If latest release still does not works. Manually tuning values above might works. num_stages is an int...
Since kv int4 requires triton>=2.3.0, It would be cool if we add a check in engine. https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/pytorch/check_env/__init__.py
`quant_policy` might not be a good argument name since user might misunderstand this as online quant(gemm)
We have not support all ops in pytorch yet.
[This](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/07-developer-guide/support_new_model.md) is the tutorial about how to support new models. We have place all rewriter for MMRotate in https://github.com/open-mmlab/mmdeploy/tree/master/mmdeploy/codebase/mmrotate/models . Try add the rewriter to support the model you want.
可以试试看这个 ```python import openai from typing import List import asyncio async def chat_single(client: openai.AsyncOpenAI, prompt: str, model_name: str): """chat single async""" response = await client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": prompt}],...
还有, distill 的 r1 turbomind 应该都支持的,awq 的性能也是 turbomind 更好,如果追求性能更推荐用 turbomind。