PQCache icon indicating copy to clipboard operation
PQCache copied to clipboard

[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Results 9 PQCache issues
Sort by recently updated
recently updated
newest added

非常出色的工作!您在文中4.1.4章节提到实验GPU配置是一张RTX 4090 24GB,我使用llama3.1测试longbench,在最开始的narrativeqa任务就出现OOM问题。您提供的run_llama.sh文件中device设置为0和1,我只设置了device为0,在单张4090进行实验,其他设置均和sh文件保持一致。 ``` Traceback (most recent call last): File "/root/PQCache/vq_pred.py", line 463, in get_pred(args, model, tokenizer, 0, world_size, data_all, max_length, max_gen, File "/root/PQCache/vq_pred.py", line 178, in get_pred output =...

![Image](https://github.com/user-attachments/assets/1ea01749-c936-4ff0-a66d-c3eb27dda531) 您好我一直出这个warning 这个合理吗我想知道

I want to measure the latency, but the code doesn't seem to provide the nah_input.jsonl. So, how can I use the test_latency.py? Thanks!

您好,请问您尝试过更长的文本吗?比如128k。 当我尝试128k的上下文时,将一直停留在prefill阶段的kmeans阶段,这是否超过了cpu负载? 如果您可以提供相关的解决方案,我将不胜感激!

For my own env this parameter will cause issues running it, maybe an instruction on how to correctly set the num of cpus is required? I haven't test too many...

您好, 我在复现您论文中的实验时遇到了一些问题,想向您请教一下。 我使用了 pq方法在 InfiniteBench 测试集上进行评测,但结果与论文中报告的数值差异较大。例如,在 Retr.PassKey 任务上,论文中的得分为 100,而我复现的结果仅为 37.63。 我的主要实验参数如下: 压缩率(compression ratio):0.1 recent 比例:0.2 其他参数均为默认配置。 另外,我在相同配置下运行 full attention 方法时,结果表现正常,因此我怀疑可能是 InfiniteBench 相关代码或配置有所不同。 请问您是否方便提供 InfiniteBench 实验部分的代码 或者 复现实验的具体说明? 非常感谢您的时间与帮助!

The paper mentions that the experiment results on Mistral-7B-inst-v0.2 are in Appendix A, but the appendix is not attached in the paper. Could you please update the results or show...

我按照env.yml进行了环境配置(将torch中的cu121均改为118)之后执行bash run_mistral.sh出现Traceback (most recent call last): File "/home/xuhan/KV_Cache/PQCache/vq_pred.py", line 11, in from vq_method.llama_patch import VQLlamaForCausalLM File "/home/xuhan/KV_Cache/PQCache/vq_method/llama_patch.py", line 22, in from .retrieval_based.pq_search import * File "/home/xuhan/KV_Cache/PQCache/vq_method/retrieval_based/pq_search.py", line 10, in from...

In paper, 4.1.4 Hardware Environment and Hyperparameters, Unless otherwise specified, for each experiment, we use an NVIDIA GeForce RTX 4090 24GB card for GPU computation, two Intel(R) Xeon(R) Gold 6330...