Li Zhang comments

Results 73 comments of


                                            Li Zhang

Torch engine prefix caching

> For example seq1: `xxxxyyyyzzzz`, seq2: `yyyyzzzz`, 4 tokens per block, for general cache, seq2 may use the last 2 cached blocks of seq1. In this case 1. The positional...

[Bug] group size=64的awq量化精度损失明显

1.0版本有的inline ptx忘了加volatile，导致某些case的结果不正确。可以试试turbomind-2.1分支。

[Feature] Support for Mistral

Currently, lmdeploy has no problem running mistral-7b. The plan is to add chat template after window attention is supported.

【Design Questinon】any plan to decouple batching and cache from llama?

In fact, currently turbomind only supports llama family models.😂 It is ongoing work to decouple the engine and model implementation (likely to finish in october).

[Bug] It seems the memory of internlm2 is bad when input prompt is longtext.

This is most likely to be casused by the randomness introduced by sampling. With top-k sampling, you will even get different results with the same version. With internlm2-chat-7b and `rope_scaling_factor=2.0,...

[Bug] It seems the memory of internlm2 is bad when input prompt is longtext.

#1636 will only make a difference when the prompt is short and max_new_tokens is large. I tried without it and the result is OK.

[Bug] It seems the memory of internlm2 is bad when input prompt is longtext.

#1116 adds support for linear rope scaling so we need to distingush between `dynamic` or `linear` rope scaling method. The flag `use_dynamic_ntk` is changed to set based on model's config.json...

GPTQ 和 AWQ 的推理 kernel 能否互用？

权重 layout 调整好了理论上是可以的

GPTQ 和 AWQ 的推理 kernel 能否互用？

> 另外，有个新的论文不知道你们是否研究过，代码仓库在这里，一个更快的int4fp16的实现。这玩意并不快。

GPTQ 和 AWQ 的推理 kernel 能否互用？

layout 调整好了就没有 bank conflict > 比如一个 warp 内的相邻线程同时向shared_memory的相邻地址写入64bit数据，虽然线程 0 和线程 16 是写入同一个 bank，但是实际上profile 是没有 bank 冲突的。 32 个 32-bit 的 bank 无法同时执行 32 个 64-bit 读写操作。64-bit 读写操作是分 0-15 和 16-31...