Li Zhang
Li Zhang
This greatly complicates the demos which are already complicated enough.
Yes it's still useful for sm_80 GPUs. It benchmarks not only cuBLAS but also cuBLASLt (which has a lot more combinations than cuBLAS).
I had the same problem in a non-docker environment too. Adding `mpi_cxx` to the link dependency of `mpi_utils` solved it.
目前码表没有按TP切分,Qwen的码表特别大影响会比较明显。 我看看怎么加一下
会支持,不过没那么快,估计2周以后了。
根据目前的信息难以定位出问题的位置,可以设置环境变量`export TM_DEBUG_LEVEL=DEBUG`再试试
可以试试`export TM_DEBUG_LEVEL=DEBUG`。然后条件允许的话用gdb起server,会对定位问题比较有帮助。
FP8 KV cache will be a lot more easier. You will need to add some template specialization for type conversion and some code for dispatching the kernels.
We don't have plan to support FP8 KV cache, as the current INT8 implementation works just fine and it also works on pre `sm_89` devices. (well the fact is that...
> In current implementation, the blocks in block trie are computed and read-only. We only cache and match computed blocks. So shared blocks will not be re-written multiple times. I...