Li Zhang

Results 73 comments of Li Zhang

> For example seq1: `xxxxyyyyzzzz`, seq2: `yyyyzzzz`, 4 tokens per block, for general cache, seq2 may use the last 2 cached blocks of seq1. In this case 1. The positional...

1.0版本有的inline ptx忘了加volatile,导致某些case的结果不正确。 可以试试turbomind-2.1分支。

Currently, lmdeploy has no problem running mistral-7b. The plan is to add chat template after window attention is supported.

In fact, currently turbomind only supports llama family models.😂 It is ongoing work to decouple the engine and model implementation (likely to finish in october).

This is most likely to be casused by the randomness introduced by sampling. With top-k sampling, you will even get different results with the same version. With internlm2-chat-7b and `rope_scaling_factor=2.0,...

#1636 will only make a difference when the prompt is short and max_new_tokens is large. I tried without it and the result is OK.

#1116 adds support for linear rope scaling so we need to distingush between `dynamic` or `linear` rope scaling method. The flag `use_dynamic_ntk` is changed to set based on model's config.json...

权重 layout 调整好了理论上是可以的

> 另外,有个新的论文不知道你们是否研究过,代码仓库在这里,一个更快的int4fp16的实现。 这玩意并不快。

layout 调整好了就没有 bank conflict > 比如一个 warp 内的相邻线程同时向shared_memory的相邻地址写入64bit数据,虽然线程 0 和线程 16 是写入同一个 bank,但是实际上profile 是没有 bank 冲突的。 32 个 32-bit 的 bank 无法同时执行 32 个 64-bit 读写操作。64-bit 读写操作是分 0-15 和 16-31...