Li Zhang comments

Results 73 comments of


                                            Li Zhang

add batch inference in demo

This greatly complicates the demos which are already complicated enough.

Does gpt_gemm still useful when use sm_80 and newer GPU architectures

Yes it's still useful for sm_80 GPUs. It benchmarks not only cuBLAS but also cuBLASLt (which has a lot more combinations than cuBLAS).

Undefined reference to `MPI::Comm::Comm()'

I had the same problem in a non-docker environment too. Adding `mpi_cxx` to the link dependency of `mpi_utils` solved it.

[Bug] 0.4.0是还不支持qwen1.5 110b吗?

目前码表没有按TP切分，Qwen的码表特别大影响会比较明显。我看看怎么加一下

[Bug] 0.4.0是还不支持qwen1.5 110b吗?

会支持，不过没那么快，估计2周以后了。

双卡V100 使用 lmdeploy cli 部署InternLM2-Chat-20B服务, 运行一段时间后,请求报错: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231

根据目前的信息难以定位出问题的位置，可以设置环境变量`export TM_DEBUG_LEVEL=DEBUG`再试试

双卡V100 使用 lmdeploy cli 部署InternLM2-Chat-20B服务, 运行一段时间后,请求报错: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231

可以试试`export TM_DEBUG_LEVEL=DEBUG`。然后条件允许的话用gdb起server，会对定位问题比较有帮助。

[Feature] TurboMind support W8A8 or FP8 KV Cache

FP8 KV cache will be a lot more easier. You will need to add some template specialization for type conversion and some code for dispatching the kernels.

[Feature] TurboMind support W8A8 or FP8 KV Cache

We don't have plan to support FP8 KV cache, as the current INT8 implementation works just fine and it also works on pre `sm_89` devices. (well the fact is that...

Turbomind prefix caching

> In current implementation, the blocks in block trie are computed and read-only. We only cache and match computed blocks. So shared blocks will not be re-written multiple times. I...