q yao comments

Results 318 comments of


                                            q yao

[Feature] 支持Triton版本FA V2

有具体的对比方案以及数据吗？目前 pytorch engine 这边 decoding 的瓶颈主要是在 host 这边，perfill 由于做了 split and slice 实际上 Linear 的开销要远大于 attention。最好是能有相关数据评估一下

PyTorch Engine hash table based prefix caching

@zhyncs llama13b + 128 concurrency + 3000 prompts `prompt = SYSTEM_PROMPT + prompt` w/o caching ``` concurrency: 128 elapsed_time: 576.054s number of prompt tokens: 758360 number of completion tokens: 725516...

PyTorch Engine hash table based prefix caching

@zhyncs I have made some mistakes when performing the benchmark above. After update this line https://github.com/InternLM/lmdeploy/blob/1f72b8f33821051dafa35502f1efc2a60d2440c6/benchmark/profile_restful_api.py#L35 new result: w/o caching ``` concurrency: 128 elapsed_time: 548.647s number of prompt tokens: 1007244...

lmdeploy error when using pytorch backend in torch 2.2.0 version

> GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Higher version required.

[Feature] Turbomind engine prefix caching

Good to me.

[Feature] 给 lmdeploy pytorch引擎，添加一个权重参数加载精度的参数。

改起来比较花时间，急用的话可以先改 config.json

Torch engine prefix caching

Any detail about the hash table implementation? Honestly, I do not like my radix tree implementation in this PR.

Torch engine prefix caching

When do we need general cache?

Torch engine prefix caching

@ispobock Do they support window attention? How do they evict blocks? Would it take a long time if we have a large amount of blocks? s-lora would increase number of...

Torch engine prefix caching

Sure, let's ignore the sliding window for now. It seems that the hash map does not bring much benefits to prefix matching. Eviction by blocks takes more time than eviction...