q yao

Results 318 comments of q yao

有具体的对比方案以及数据吗? 目前 pytorch engine 这边 decoding 的瓶颈主要是在 host 这边,perfill 由于做了 split and slice 实际上 Linear 的开销要远大于 attention。最好是能有相关数据评估一下

@zhyncs llama13b + 128 concurrency + 3000 prompts `prompt = SYSTEM_PROMPT + prompt` w/o caching ``` concurrency: 128 elapsed_time: 576.054s number of prompt tokens: 758360 number of completion tokens: 725516...

@zhyncs I have made some mistakes when performing the benchmark above. After update this line https://github.com/InternLM/lmdeploy/blob/1f72b8f33821051dafa35502f1efc2a60d2440c6/benchmark/profile_restful_api.py#L35 new result: w/o caching ``` concurrency: 128 elapsed_time: 548.647s number of prompt tokens: 1007244...

> GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Higher version required.

改起来比较花时间,急用的话可以先改 config.json

Any detail about the hash table implementation? Honestly, I do not like my radix tree implementation in this PR.

When do we need general cache?

@ispobock Do they support window attention? How do they evict blocks? Would it take a long time if we have a large amount of blocks? s-lora would increase number of...

Sure, let's ignore the sliding window for now. It seems that the hash map does not bring much benefits to prefix matching. Eviction by blocks takes more time than eviction...