q yao

Results 318 comments of q yao

Update tensorrt version or just remove bicubic resize model in `torch2trt_dynamic/converters/grid_sample.py`.

I can not test solo on my device. OOM.

https://github.com/InternLM/lmdeploy/blob/967df47f574056740cb45b52338563373730c144/lmdeploy/pytorch/kernels/cuda/flashattention.py#L498 try manually tunning these arguements.

@cuikaiGitHub 0.8.0 is a quite old version, try switch to our latest release. If latest release still does not works. Manually tuning values above might works. num_stages is an int...

Since kv int4 requires triton>=2.3.0, It would be cool if we add a check in engine. https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/pytorch/check_env/__init__.py

`quant_policy` might not be a good argument name since user might misunderstand this as online quant(gemm)

[This](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/07-developer-guide/support_new_model.md) is the tutorial about how to support new models. We have place all rewriter for MMRotate in https://github.com/open-mmlab/mmdeploy/tree/master/mmdeploy/codebase/mmrotate/models . Try add the rewriter to support the model you want.

可以试试看这个 ```python import openai from typing import List import asyncio async def chat_single(client: openai.AsyncOpenAI, prompt: str, model_name: str): """chat single async""" response = await client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": prompt}],...

还有, distill 的 r1 turbomind 应该都支持的,awq 的性能也是 turbomind 更好,如果追求性能更推荐用 turbomind。