ktransformers [Bug] RuntimeError: CUDA error: invalid resource handle

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

CUDA 12.8, 运行报错：

loading model.layers.60.post_attention_layernorm.weight to cuda
loading model.norm.weight to cuda
Getting inference context from sched_client.
sched_rpc started with PID: 5175
Got inference context, sending it to subscribers.
Rebuilding kvcache
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 300, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 215, in __init__
    inference_context = self.sched_client.rebuild_inferece_context(inference_context)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/balance_serve/sched_rpc.py", line 196, in rebuild_inferece_context
    inference_context.k_cache = [fn(*args) for fn,args in data['k_cache']]
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/balance_serve/sched_rpc.py", line 196, in <listcomp>
    inference_context.k_cache = [fn(*args) for fn,args in data['k_cache']]
                                 ^^^^^^^^^
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aigao/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/storage.py", line 1452, in _new_shared_cuda
    return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid resource handle
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Reproduction

启动命令： python ktransformers/server/main.py --model_path /mnt/d/deepseek/r1/DeepSeek-R1 --gguf_path /mnt/d/deepseek/r1/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S --architectures DeepseekV3ForCausalLM --cpu_infer 35 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml --port 10002 --chunk_size 256 --max_new_tokens 1024 --max_batch_size 4 --cache_lens 32768 --backend_type balance_serve

Environment

WSL和纯Ubuntu都试了，都一样，但是曾经一样的编译安装方式是没问题的，区别就是重装了系统，完全重装的，conda环境：

accelerate==1.10.0
annotated-types==0.7.0
anyio==4.10.0
blessed==1.21.0
blobfile==3.0.0
build==1.3.0
certifi==2025.8.3
charset-normalizer==3.4.3
click==8.2.2
colorlog==6.9.0
cpufeature==0.2.1
distro==1.9.0
einops==0.8.1
fastapi==0.116.1
filelock==3.13.1
fire==0.7.0
flash_attn @ file:///mnt/d/flash_attn-2.8.2%2Bcu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl#sha256=bf8e2091a699751a2311586c0a26f49e6ee6ed305ab85d0d50873f747e27cbab
flashinfer-python @ file:///home/aigao/ktransformers/third_party/custom_flashinfer
fsspec==2024.6.1
greenlet==3.2.4
h11==0.16.0
hf-xet==1.1.7
httpcore==1.0.9
httpx==0.28.1
huggingface-hub==0.34.4
idna==3.10
Jinja2==3.1.4
jiter==0.10.0
jsonpatch==1.33
jsonpointer==3.0.0
ktransformers @ file:///home/aigao/ktransformers
langchain==0.3.27
langchain-core==0.3.74
langchain-text-splitters==0.3.9
langsmith==0.4.13
lxml==6.0.0
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.3
ninja==1.11.1.4
numpy==2.1.2
nvidia-cublas-cu12==12.8.3.14
nvidia-cuda-cupti-cu12==12.8.57
nvidia-cuda-nvrtc-cu12==12.8.61
nvidia-cuda-runtime-cu12==12.8.57
nvidia-cudnn-cu12==9.7.1.26
nvidia-cufft-cu12==11.3.3.41
nvidia-cufile-cu12==1.13.0.11
nvidia-curand-cu12==10.3.9.55
nvidia-cusolver-cu12==11.7.2.55
nvidia-cusparse-cu12==12.5.7.53
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.8.61
nvidia-nvtx-cu12==12.8.55
openai==1.99.6
orjson==3.11.1
packaging==25.0
pillow==11.0.0
protobuf==6.31.1
psutil==7.0.0
pycryptodomex==3.23.0
pydantic==2.11.7
pydantic_core==2.33.2
pyproject_hooks==1.2.0
PyYAML==6.0.2
pyzmq==27.0.1
regex==2025.7.34
requests==2.32.4
requests-toolbelt==1.0.0
safetensors==0.6.2
sentencepiece==0.2.0
sniffio==1.3.1
SQLAlchemy==2.0.42
starlette==0.47.2
sympy==1.13.3
tenacity==9.1.2
termcolor==3.1.0
tiktoken==0.11.0
tokenizers==0.21.4
torch==2.7.1+cu128
torchaudio==2.7.1+cu128
torchvision==0.22.1+cu128
tqdm==4.67.1
transformers==4.53.3
triton==3.3.1
typing-inspection==0.4.1
typing_extensions==4.12.2
urllib3==2.5.0
uvicorn==0.35.0
wcwidth==0.2.13
zmq==0.0.0
zstandard==0.23.0

Aug 12 '25 03:08 mrgaolei

And this is rpc.log

[2025-08-12 17:21:55.161] [info] [scheduler.cpp:31] Number of available GPUs: 1, want 1
[2025-08-12 17:21:55.161] [info] [scheduler.cpp:66] Each GPU Total: 2196MiB, Model Params: 0MiB, KVCache: 2196MiB, Left: 0MiB
[2025-08-12 17:21:55.161] [info] [scheduler.cpp:87] total_kvcache_pages is auto derived as 128
[2025-08-12 17:21:55.161] [info] [scheduler.cpp:933] Using Strategy FCFS
[2025-08-12 17:21:55.161] [info] [scheduler.cpp:459]
Scheduler Settings:
  model_name: DeepSeek-R1
  quant_type: BF16
    model_path: /mnt/d/deepseek/r1/DeepSeek-R1
    params_count: 0
    layer_count: 61
    num_k_heads: 1
    k_head_dim: 576
    bytes_per_params: 2
    bytes_per_kv_cache_element: 2
  page_size: 256
  gpu_device_id: 0
  gpu_memory_size: 2.30G
  memory_utilization_percentage: 1
  max_batch_size: 4
  recommended_chunk_prefill_token_count: 127
  sched_metrics_port: 43061
  kvc2_config_path: /home/aigao/.ktransformers/kvc2
  kvc2_root_path: /home/aigao/kvc
  memory_pool_size_GB: 200
  evict_count: 40
  kvc2_metrics_port: 57209
  load_from_disk: false
  save_to_disk: true
  strategy_name: FCFS
  gpu_device_count: 1

load_model_configs from "/home/aigao/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
 - DeepSeek-R1
Load from "/mnt/d/deepseek/r1/DeepSeek-R1/config.json"
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1372] Creating KVC2 using these config
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1373]     GPU Only: false
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1374]     Load: false, Save: true
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1375]     Path: /home/aigao/kvc
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1376]     Config Path: /home/aigao/.ktransformers/kvc2
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1377]     Num Token/Page: 256, Memory Pool Size: 200.00G
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1379]     Evict Count: 40, Metrics Port: 57209
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1380]     Recompute Ratio: 0.20
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1384]     GPU Devices: 0
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1385]     Layer Count: 61, Total KVCache Pages: 128
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1387]     Num Token/Page: 256, Num K Heads: 1
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1388]     K Head Dim: 576, Tensor Type: 15
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1390]     MemcpyCudaStreams/Device: 4
load_model_configs from "/home/aigao/.ktransformers/kvc2/model_configs.json"
Loaded Model Configs
 - DeepSeek-R1
load_quant_configs no file at "/home/aigao/.ktransformers/kvc2/quant_configs.json"
[2025-08-12 17:21:55.167] [info] [prefix.cpp:1401] Creating kvc2 metrics exporter on 0.0.0.0:57209
[2025-08-12 17:21:55.167] [info] [prefix.cpp:278] DiskCacheManager root path: /home/aigao/kvc
[2025-08-12 17:21:55.168] [debug] [page_aligned_memory_pool.cpp:25] first_page[0] = 0, count_page[0] = 3051757
[2025-08-12 17:21:55.172] [debug] [page_aligned_memory_pool.cpp:25] first_page[1] = 12499996672, count_page[1] = 3051757
[2025-08-12 17:21:55.177] [debug] [page_aligned_memory_pool.cpp:25] first_page[2] = 24999993344, count_page[2] = 3051757
[2025-08-12 17:21:55.182] [debug] [page_aligned_memory_pool.cpp:25] first_page[3] = 37499990016, count_page[3] = 3051757
[2025-08-12 17:21:55.187] [debug] [page_aligned_memory_pool.cpp:25] first_page[4] = 49999986688, count_page[4] = 3051757
[2025-08-12 17:21:55.191] [debug] [page_aligned_memory_pool.cpp:25] first_page[5] = 62499983360, count_page[5] = 3051757
[2025-08-12 17:21:55.195] [debug] [page_aligned_memory_pool.cpp:25] first_page[6] = 74999980032, count_page[6] = 3051757
[2025-08-12 17:21:55.200] [debug] [page_aligned_memory_pool.cpp:25] first_page[7] = 87499976704, count_page[7] = 3051757
[2025-08-12 17:21:55.206] [debug] [page_aligned_memory_pool.cpp:25] first_page[8] = 99999973376, count_page[8] = 3051757
[2025-08-12 17:21:55.211] [debug] [page_aligned_memory_pool.cpp:25] first_page[9] = 112499970048, count_page[9] = 3051757
[2025-08-12 17:21:55.216] [debug] [page_aligned_memory_pool.cpp:25] first_page[10] = 124999966720, count_page[10] = 3051757
[2025-08-12 17:21:55.221] [debug] [page_aligned_memory_pool.cpp:25] first_page[11] = 137499963392, count_page[11] = 3051757
[2025-08-12 17:21:55.226] [debug] [page_aligned_memory_pool.cpp:25] first_page[12] = 149999960064, count_page[12] = 3051757
[2025-08-12 17:21:55.231] [debug] [page_aligned_memory_pool.cpp:25] first_page[13] = 162499956736, count_page[13] = 3051757
[2025-08-12 17:21:55.236] [debug] [page_aligned_memory_pool.cpp:25] first_page[14] = 174999953408, count_page[14] = 3051757
[2025-08-12 17:21:55.240] [debug] [page_aligned_memory_pool.cpp:25] first_page[15] = 187499950080, count_page[15] = 3051770
[2025-08-12 17:21:55.245] [info] [page_aligned_memory_pool.cpp:30] PageAlignedMemoryPool with size 190734 Mbytes, 48828125 pages
[2025-08-12 17:21:55.246] [info] [gpu_cache.cpp:15] Number of available GPUs: 1, want 1
[2025-08-12 17:21:55.246] [warning] [gpu_cache.cpp:28] Creating GPU Cache
[2025-08-12 17:21:55.246] [info] [gpu_cache.cpp:45] Creating KV Page Cache, Shape (61,128,256,1,576), Size 2196 MiB
2025/08/12 17:21:55.669832|INFO |th=000077D8E8000F50|epoll.cpp:319|new_epoll_engine:Init epoll event engine: master
2025/08/12 17:21:55.670009|DEBUG|th=000077D8E8000F50|reset_handle.cpp:27|ResetHandle:push [this=000077D8E80010B0]
2025/08/12 17:21:55.670421|DEBUG|th=000077D8E8000F50|reset_handle.cpp:27|ResetHandle:push [this=000077D8E8000E60]
2025/08/12 17:21:55.670530|DEBUG|th=000077D8E8000F50|aio-wrapper.cpp:398|libaio_wrapper_init:libaio initialized
2025/08/12 17:21:55.670576|DEBUG|th=000077D8E8000F50|photon.cpp:128|__photon_init:reset_all_handle registed [getpid()=10256]
2025/08/12 17:21:55.670701|INFO |th=000077D8E8000F50|async_store.cpp:200|io_dealer:Initializing IO Dealer
2025/08/12 17:21:55.683385|INFO |th=000077D8F27F9C40|async_store.cpp:171|io_perf:IO Depth 128
[2025-08-12 17:21:56.887] [info] [gpu_cache.cpp:54] K Page Cache of GPU 0 is created
[2025-08-12 17:21:56.887] [warning] [gpu_cache.cpp:72] Disalbe V Cache
[2025-08-12 17:21:56.888] [info] [prefix.cpp:1709] Starting CPU Background flush
[2025-08-12 17:21:56.888] [info] [prefix.cpp:1747] Starting GPU Background flush
[2025-08-12 17:21:56.888] [info] [scheduler.cpp:513] Creating scheduler metrics exporter on 0.0.0.0:43061
[2025-08-12 17:21:56.891] [warning] [scheduler.cpp:668] Starting Scheduler Event Loop

Aug 12 '25 09:08 mrgaolei

再次补充：git tag切换到v0.3.2，用非blance_serve启动，就没问题，用blance_serve启动会提示这个。

如果git pull到HEAD版本，则都不行。但是我看好像HEAD版本已经默认是blance_serve引擎了。难道说，blance_serve要求必须512G内存？我这台机器唯一的区别就是把512G内存降级成了256，但是我跑的Q2模型，理论也够的，因为kt默认引擎都可以启动。

Aug 13 '25 03:08 mrgaolei

一样的问题，有解决的方案吗

Aug 17 '25 10:08 SaiHoCao