[Feature] 0.3.2的最新版目前支持deepseek的哪些动态量化版本?
Checklist
- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [ ] 2. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-English/Chinese content without translation may be closed.
Motivation
我编译安装了KT0.3.2,尝试运行Deepseek-R1-671B-0528-IQ2_M版本,但是在加载模型过程中,我发现内存使用量并没有增加,通过debug,发现MOE层的参数全都是元数据,只有CUDA层的参数被真正加载,显存也有占用,CUDA Graph构建过程正常,最后卡在输出第一个字之前,过一会还会报错: Traceback (most recent call last): File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 313, in run_engine engine.loop() File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 265, in loop self.model_runner.run(self.batch, self.query_manager) File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 194, in run self.features = self.model.batch_embeddings(self.input[cuda_graph_idx], device=self.device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py", line 66, in batch_embeddings self.model.embed_tokens(tokens.to(torch.device('cpu'))) File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 192, in forward return F.embedding( ^^^^^^^^^^^^ File "/home/wangqs/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/functional.py", line 2546, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ IndexError: index out of range in self
一个比较重要的问题是,我只有家用平台,256GB内存,KT目前为止支持哪些动态量化版本?我好选择性能和内存较平衡的版本。
Related resources
No response