[Bug] deepseek v3/r1 with full context with balance_serve backend
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
I am not sure how to run the balance_serve backend with a full 128k context. If I will try something like that:
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((128*1024)) \
--chunk_size 256 \
--max_batch_size 1 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
I will have the error regarding the cuda_graphs.
Rebuilding kvcache
kv_cache loaded successfully.
capturing cuda graph 1 1
2025-07-03 13:06:22,408 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-07-03 13:06:46,487 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_profiler_False
2025-07-03 13:06:46,536 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-07-03 13:07:02,601 - INFO - flashinfer.jit: Finished loading JIT ops: norm
cuda_graph: 1/6, warmup finished.
capturing cuda graph 2 2
cuda_graph: 2/6, warmup finished.
capturing cuda graph 3 3
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 279, in run_engine
engine.model_runner.warmup()
~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 133, in warmup
self.model_attn_plan(self.input[i], i)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 92, in model_attn_plan
self.model.flash_infer_attn_plan(batch, self.bsz_tensor_buf, self.num_tokens_tensor_buf,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
num_heads=self.model.config.num_attention_heads, head_dim_ckv=self.model.config.kv_lora_rank,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
head_dim_kpe=self.model.config.qk_rope_head_dim, page_size=self.model.cache.page_size, causal=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sm_scale=self.model.model.layers[0].self_attn.softmax_scale, q_data_type=torch.bfloat16, kv_data_type=torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py", line 146, in flash_infer_attn_plan
self.wrapper.plan(minibatch.q_indptr, minibatch.kv_indptr, minibatch.kv_indices,
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
minibatch.kv_len, num_heads, head_dim_ckv, head_dim_kpe, page_size, causal, sm_scale, q_data_type, kv_data_type, bsz_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/flashinfer/mla.py", line 246, in plan
self._qo_indptr_buf[:len(qo_indptr)].copy_(qo_indptr, non_blocking=True)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 0
I would not have a similar issue in case with ktransformers backend, but it doesn't support the disk prefix caching. So what to do?
[UPDATE]: the ephemeral vibe-coded solution seems to be currently found. https://github.com/kvcache-ai/ktransformers/issues/1417#issuecomment-3040256985
Reproduction
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((128*1024)) \
--chunk_size 256 \
--max_batch_size 1 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
Environment
uname -a
Linux xxx 6.12.32-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.32-1 (2025-06-07) x86_64 GNU/Linux
nvidia-smi
Thu Jul 3 12:32:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.02 Driver Version: 575.51.02 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:41:00.0 Off | N/A |
| 30% 35C P8 31W / 350W | 160MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
uv pip list
Using Python 3.13.3 environment at: /opt/ktransformers/.ktransformers
Package Version
------------------------ ------------------------
accelerate 1.8.1
annotated-types 0.7.0
anyio 4.9.0
blessed 1.21.0
build 1.2.2.post1
certifi 2025.6.15
charset-normalizer 3.4.2
click 8.2.1
colorlog 6.9.0
cpufeature 0.2.1
distro 1.9.0
fastapi 0.115.14
filelock 3.18.0
fire 0.7.0
flashinfer-python 0.2.3
fsspec 2025.5.1
greenlet 3.2.3
h11 0.16.0
hf-xet 1.1.5
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.33.2
idna 3.10
jinja2 3.1.6
jiter 0.10.0
jsonpatch 1.33
jsonpointer 3.0.0
ktransformers 0.3.2+cu129torch29avx2
langchain 0.3.26
langchain-core 0.3.67
langchain-text-splitters 0.3.8
langsmith 0.4.4
markupsafe 2.1.5
mpmath 1.3.0
networkx 3.5
ninja 1.11.1.4
numpy 2.3.1
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.2.5
nvidia-nvtx-cu12 12.8.90
openai 1.92.2
orjson 3.10.18
packaging 24.2
pillow 11.2.1
protobuf 6.31.1
psutil 7.0.0
pydantic 2.11.7
pydantic-core 2.33.2
pyproject-hooks 1.2.0
pytorch-triton 3.3.1+gitc8757738
pyyaml 6.0.2
pyzmq 27.0.0
regex 2024.11.6
requests 2.32.4
requests-toolbelt 1.0.0
safetensors 0.5.3
sentencepiece 0.2.0
setuptools 78.1.0
sniffio 1.3.1
sqlalchemy 2.0.41
starlette 0.46.2
sympy 1.14.0
tenacity 9.1.2
termcolor 3.1.0
tokenizers 0.21.2
torch 2.9.0.dev20250702+cu128
torchaudio 2.8.0.dev20250702+cu128
torchvision 0.24.0.dev20250702+cu128
tqdm 4.67.1
transformers 4.51.3
triton 3.3.1
typing-extensions 4.14.0
typing-inspection 0.4.1
urllib3 2.5.0
uvicorn 0.35.0
wcwidth 0.2.13
wheel 0.45.1
zmq 0.0.0
zstandard 0.23.0
In case I am using the --no-use_cuda_graph it seems to solve the issue but the inference speed (decode, not prefill) is getting x2 times slower.
So what is going on? Why the one-session ktransformers backend works fine with cuda_graphs but the balance_serve backend doesn't seem to be functional with one session and if I am to try to support [four] sessions with full context on 24 GB VRAM there would be OOM. :)
Namely, if I would try this:
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((128*1024)) \
--chunk_size 256 \
--max_batch_size 4 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
I would get this during the inference:
2025-07-03 17:46:18,122 DEBUG /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py[418]: get input ids of shape torch.Size([1, 126174])
add query id: 2, batch.query_lengths: 126174, batch_query_tokens: torch.Size([131072]), batch.block_indexes: tensor([1023, 0, 1, ..., 1020, 1021, 1022], dtype=torch.int32)
prefill_batch_i: 254,
Model execution time (GPU): 4192.625 ms, 0.239 tokens/s
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
engine.loop()
~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
generated_tokens, probs = self.sampling( self.model_runner.output)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
generated_tokens, probs=self.sampler(logit, sample_options)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 97, in forward
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f30eb57ef00 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111a7 (0x7f30f211f1a7 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1c3cb (0x7f30f212a3cb in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1771c (0x7f30f212571c in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1be27 (0x7f30f2129e27 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d646 (0x7f30f212b646 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x31b9d (0x7f30f213fb9d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0x36ccda (0x7f30e496ccda in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x628158 (0x7f30e4c28158 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x438b28 (0x7f30e4a38b28 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #10: c10::TensorImpl::~TensorImpl() + 0x1c5 (0x7f30eb55c145 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #11: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f30eb55c1c9 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #12: <unknown function> + 0x6cfb48 (0x7f30e4ccfb48 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x6cff1d (0x7f30e4ccff1d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #14: /opt/ktransformers/.ktransformers/bin/python3() [0x52daa2]
frame #15: /opt/ktransformers/.ktransformers/bin/python3() [0x54247d]
frame #16: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #17: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #18: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #19: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #20: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #21: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #22: /opt/ktransformers/.ktransformers/bin/python3() [0x59c1af]
frame #23: /opt/ktransformers/.ktransformers/bin/python3() [0x59991b]
frame #24: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #25: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #26: /opt/ktransformers/.ktransformers/bin/python3() [0x5b3118]
frame #27: /opt/ktransformers/.ktransformers/bin/python3() [0x5cad16]
frame #28: _PyEval_EvalFrameDefault + 0x4bde (0x561a1e in /opt/ktransformers/.ktransformers/bin/python3)
frame #29: PyEval_EvalCode + 0xcc (0x64d9ac in /opt/ktransformers/.ktransformers/bin/python3)
frame #30: /opt/ktransformers/.ktransformers/bin/python3() [0x66da21]
frame #31: /opt/ktransformers/.ktransformers/bin/python3() [0x669a8c]
frame #32: /opt/ktransformers/.ktransformers/bin/python3() [0x65c015]
frame #33: /opt/ktransformers/.ktransformers/bin/python3() [0x65be2c]
frame #34: Py_RunMain + 0x2a9 (0x680df9 in /opt/ktransformers/.ktransformers/bin/python3)
frame #35: Py_BytesMain + 0x2b (0x63d36b in /opt/ktransformers/.ktransformers/bin/python3)
frame #36: <unknown function> + 0x29ca8 (0x7f30f4b4bca8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: __libc_start_main + 0x85 (0x7f30f4b4bd65 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: _start + 0x21 (0x63c701 in /opt/ktransformers/.ktransformers/bin/python3)
[Thu Jul 3 17:46:22 2025] NVRM: Xid (PCI:0000:41:00): 31, pid=610922, name=python3, Ch 0000001f, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC1 GPCCLIENT_T1_0 faulted @ 0x4_0da53000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Is there any config for a multi-gpu with balance_serve support?
I don't get why the prefix caching with disk support wasn't implemented in ktransformers backend?
The balance_serve backend unfortunately is not usable. It limits the the output of the model up to 32k tokens in order to do what? To potentially support up to 4 simultaneous clients? What for? In case I would want 4 simultaneous sessions I would build 4 separate computers and route the queries myself. I DO NOT need a support of such stuff in the llm inference framework. Please make something to work stable with full 128k context. Otherwise its impossible to run deepseek with full 128k context nowhere -- neither with balance_serve (as it seems mandatory to specify four sessions and limit the max token output to 32k) or its fully working with ktransfomers backend but it doesn't support the prefix disk caching and its seems to getting deprecated. This is clearly ridiculous.
Ha. I can't even do 126k prefill with max_batches=4 and max_tokens=132k. Ha. Its doing OOM at about 90k prefill.
Not sure what is going on. Let me kill the xfce4 process that tool 50MB of VRAM and try again. May be there is an issue with gpu_utilization configuration.
[EDIT]. I tried with xfce4 killed so that only ktransformers takes up the VRAM and result is the same.
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((32*1024)) \
--chunk_size 256 \
--max_batch_size 4 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
and its crashing after 98k context prefill. Its steadily consuming about 22.6 GB of VRAM and its just crashing. So not only decode of 128k context is NOT working with balance_serve. But the prefill is failing too. Did anyone actually tested it?
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
engine.loop()
~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
generated_tokens, probs = self.sampling( self.model_runner.output)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
generated_tokens, probs=self.sampler(logit, sample_options)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 97, in forward
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f699b97ef00 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111a7 (0x7f699bd751a7 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1c3cb (0x7f699bd803cb in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1771c (0x7f699bd7b71c in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1be27 (0x7f699bd7fe27 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d646 (0x7f699bd81646 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x31b9d (0x7f699bd95b9d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0x36ccda (0x7f698e16ccda in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x628158 (0x7f698e428158 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x438b28 (0x7f698e238b28 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #10: c10::TensorImpl::~TensorImpl() + 0x1c5 (0x7f699b95c145 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #11: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f699b95c1c9 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #12: <unknown function> + 0x6cfb48 (0x7f698e4cfb48 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x6cff1d (0x7f698e4cff1d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #14: /opt/ktransformers/.ktransformers/bin/python3() [0x52daa2]
frame #15: /opt/ktransformers/.ktransformers/bin/python3() [0x54247d]
frame #16: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #17: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #18: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #19: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #20: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #21: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #22: /opt/ktransformers/.ktransformers/bin/python3() [0x59c1af]
frame #23: /opt/ktransformers/.ktransformers/bin/python3() [0x59991b]
frame #24: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #25: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #26: /opt/ktransformers/.ktransformers/bin/python3() [0x5b3118]
frame #27: /opt/ktransformers/.ktransformers/bin/python3() [0x5cad16]
frame #28: _PyEval_EvalFrameDefault + 0x4bde (0x561a1e in /opt/ktransformers/.ktransformers/bin/python3)
frame #29: PyEval_EvalCode + 0xcc (0x64d9ac in /opt/ktransformers/.ktransformers/bin/python3)
frame #30: /opt/ktransformers/.ktransformers/bin/python3() [0x66da21]
frame #31: /opt/ktransformers/.ktransformers/bin/python3() [0x669a8c]
frame #32: /opt/ktransformers/.ktransformers/bin/python3() [0x65c015]
frame #33: /opt/ktransformers/.ktransformers/bin/python3() [0x65be2c]
frame #34: Py_RunMain + 0x2a9 (0x680df9 in /opt/ktransformers/.ktransformers/bin/python3)
frame #35: Py_BytesMain + 0x2b (0x63d36b in /opt/ktransformers/.ktransformers/bin/python3)
frame #36: <unknown function> + 0x29ca8 (0x7f699e4f4ca8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: __libc_start_main + 0x85 (0x7f699e4f4d65 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: _start + 0x21 (0x63c701 in /opt/ktransformers/.ktransformers/bin/python3)
[Thu Jul 3 20:27:30 2025] python3[621906]: segfault at 7f1dcebab ip 00007f250d5be9a7 sp 00007ffda8b68f40 error 6 in libpage_aligned_memory_pool.so[1f9a7,7f250d5bd000+35000] likely on CPU 81 (core 17, socket 0)
[Thu Jul 3 20:27:30 2025] Code: 00 48 39 d3 73 30 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 8b 81 c0 03 00 00 <c6> 04 18 00 48 ff c3 48 39 d3 75 ed 49 81 e6 00 f0 ff ff f0 4d 29
[EDIT2]:
Uh oh! It seems to be working with the following config:
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((128*1024)) \
--chunk_size 256 \
--max_batch_size 4 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
indeed the prefill became 2x slower, but now its not crashing after 98k tokens prefill. Let's see what it will output after the full 126k context... [EDIT3]: oh lol the client I am using (charmbracelet/mods) just crashed, but the ktransformers still doing the prefill. So now I will wait till the end of the processing and send the same query via the mods once again. Actually, to make sure the disk storage prefill works properly, first I will shut down the ktransformers, then set it to load the prefill cache from the disk, and only after that would send the same question. If everything works, it should respond within 20s. Let's see...
[EDIT4]:
omg it went through entire context up until the 132k, stopped, and apparently didnt save the prefill cache.
prefill length: 126174, prefill time: 5941.570858716965, prefill tps 21.23579824262944, decode length: 4895, decode time: 741.1875729560852, decode tps 6.60426615151847
[2025-07-03 22:45:00.137] [debug] [prefix.cpp:1632] Append Tokens to 131056
[2025-07-03 22:45:00.326] [info] [prefix.cpp:1790] GPU Flushed Back 1 cols
[2025-07-03 22:45:00.326] [debug] [gpu_cache.cpp:279] Free Page: 0/8192
[2025-07-03 22:45:00.395] [debug] [prefix.cpp:1719] 61 dirty CPU pages flushed.
2025/07/03 22:45:00.153291|INFO |th=00007F63CB08CC40|async_store.cpp:177|io_perf:IO queue remaining: 0 , processed 0.5006 M. IO count: 0.0610 Kops, 2.2487 M/s
[2025-07-03 22:45:02.115] [info] [scheduler.cpp:760] Finish Query 2
[2025-07-03 22:45:02.642] [info] [scheduler.cpp:661] Batch 0x7f632d8381e0 is not consumed
Shutting down scheduler RPC service...
[W703 22:47:22.872010790 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
22:53:46 [293/1888]
==> /root/.ktransformers/logs/rpc.log <==
[2025-07-03 22:53:20.095] [info] [scheduler.cpp:661] Batch 0x7f60f80058c0 is not consumed
[2025-07-03 22:53:20.118] [info] [scheduler.cpp:661] Batch 0x7f60f8005f50 is not consumed
[2025-07-03 22:53:20.462] [info] [scheduler.cpp:661] Batch 0x7f60f80056e0 is not consumed
[2025-07-03 22:53:20.471] [info] [scheduler.cpp:661] Batch 0x7f60f80055f0 is not consumed
[2025-07-03 22:53:20.651] [info] [scheduler.cpp:597] New Query 1 is added
[2025-07-03 22:53:20.651] [info] [scheduler.cpp:720] Preparing Query 1
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:1286] Lookup TokenLength 126174
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 60a6062a5c4da9d3
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash bac7c6447e25116e
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash d61ca2a840b2a41c
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 460976e194ef97b5
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 9f4bf781e0eda9e8
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash b65bfb4764ada1ef
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 9d5c2fcfc75ac239
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 2d19fae6b86768cd
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash aa7932f86890cce6
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 14141805e2925d5c
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 6baa3912e464521e
[2025-07-03 22:53:20.651] [debug] [prefix.cpp:590] Binary Prefix Search: Not Found prefix with hash 5dfe22f04fdd23e3
[2025-07-03 22:53:20.875] [info] [prefix.cpp:1508] New Handles: 499712/499712
[2025-07-03 22:53:38.528] [debug] [prefix.cpp:1294] Found 0, Prompt Length 126174, Estimated Length 131072
[2025-07-03 22:53:38.528] [info] [prefix.cpp:1303] No Match, No need to load
[2025-07-03 22:53:38.554] [info] [prefix.cpp:1536] No match, No need to load to gpu
[2025-07-03 22:53:38.554] [info] [scheduler.cpp:735] Get handle from kvc2 Success.
[2025-07-03 22:53:38.554] [info] [scheduler.cpp:743] Ready Query 1
[2025-07-03 22:53:38.554] [info] [scheduler.cpp:915] Active query 1
[2025-07-03 22:53:38.554] [info] [scheduler.cpp:746] Prefilling Query 1
[2025-07-03 22:53:46.361] [debug] [prefix.cpp:1632] Append Tokens to 254
2025/07/03 22:53:46.990185|INFO |th=0000000000000000|async_store.cpp:66|ArrayStore:Opening /mnt/data/kvc/DeepSeek-R1-0528-GGUF/BF16/key/laye
r-0.kvc
after I restarted the ktransformers it didnt picked up the 126k previous prefill and started the prefill all over again.
[EDIT5]:
the same config, but I didn't restart the ktransformers yet.
ktransformers \
--gguf_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf \
--model_path /opt/unsloth/DeepSeek-R1-0528-GGUF/ \
--model_name unsloth/DeepSeek-R1-0528-Q2-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--optimize_config_path /opt/unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-V3-Chat-serve.yaml \
--backend_type balance_serve \
--cache_8bit True \
--cache_lens $((128*1024)) \
--length $((128*1024)) \
--max_new_tokens $((128*1024)) \
--chunk_size 256 \
--max_batch_size 4 \
--temperature 0.6 \
--top_p 0.95 \
--host 0.0.0.0 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG \
--use_cuda_graph
[2025-07-04 06:36:04.819] [info] [scheduler.cpp:597] New Query 2 is added
[2025-07-04 06:36:04.819] [info] [scheduler.cpp:720] Preparing Query 2
[2025-07-04 06:36:04.819] [debug] [prefix.cpp:1286] Lookup TokenLength 126174
[2025-07-04 06:36:04.819] [debug] [prefix.cpp:587] Binary Prefix Search: Found prefix with hash 60a6062a5c4da9d3
[2025-07-04 06:36:06.195] [debug] [prefix.cpp:149] Block 7884 -> Disk Location 7884
[2025-07-04 06:36:06.218] [debug] [prefix.cpp:966] Segment IO Submitted, total task count 0
[2025-07-04 06:36:06.218] [info] [prefix.cpp:1301] Loaded to mem
[2025-07-04 06:36:06.227] [info] [gpu_cache.cpp:187] GPU: Evicted 307 GPU pages
[2025-07-04 06:36:06.238] [info] [scheduler.cpp:735] Get handle from kvc2 Success.
[2025-07-04 06:36:06.242] [info] [scheduler.cpp:743] Ready Query 2
[2025-07-04 06:36:06.243] [info] [scheduler.cpp:915] Active query 2
[2025-07-04 06:36:06.243] [info] [scheduler.cpp:746] Prefilling Query 2
[2025-07-04 06:36:06.243] [info] [scheduler.cpp:756] Decoding Query 2
[2025-07-04 06:36:08.645] [debug] [prefix.cpp:1632] Append Tokens to 126176
[2025-07-04 06:36:08.686] [info] [prefix.cpp:1790] GPU Flushed Back 1 cols
[2025-07-04 06:36:08.686] [debug] [gpu_cache.cpp:279] Free Page: 0/8192
[2025-07-04 06:36:08.745] [debug] [prefix.cpp:1719] 61 dirty CPU pages flushed.
2025-07-04 06:44:29,467 INFO /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py[102]: Performance(T/s): prefill 35275.513847621936, decode 6.6112784278642875. Time(s): tokenize 0.8327920436859131, prefill 3.576815366744995, decode 501.1133680343628
Wait a second. How the prefix cache is working exactly? Does it make s snapshot right after the prefill phase? Or only after the whole conversation is done. That is, will I be able to send a prefill chunk of text to cache it and then just append the additional questions? Because as of now it doesnt seem to support such stuff -- one can only continue the conversation by adding the questions to the conversation, but not "replacing" them via the prefix cache disk storage utilization.
Why there is no API to make a dump (or restore) of the KV-cache on-demand? Right now the cache just grows in size and I am not sure now to work with it. Its basically a binary blob (per layer). Why the separate files weren't used for example (say, per prefill batch)?
UPDATE!
As related to the issue regarding the:
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
the Deepseek V3 0324 provided the following patch to resolve the issue:
./ktransformers/server/balance_serve/inference/sampling/sampler.py
7a8
> import time
53a55,60
> def to(self, device: torch.device):
> #"""Move all tensors to the specified device"""
> for attr in ['temperatures', 'top_ps', 'top_ks', 'min_ps']:
> if hasattr(self, attr) and getattr(self, attr) is not None:
> setattr(self, attr, getattr(self, attr).to(device))
>
56a64,72
> self.last_sync_time = time.time()
>
> def sync_if_needed(self):
> """Synchronize CUDA operations if enough time has passed"""
> current_time = time.time()
> if current_time - self.last_sync_time > 0.1: # Sync every 100ms
> if torch.cuda.is_available():
> torch.cuda.synchronize()
> self.last_sync_time = current_time
63,65c79,94
< if sampling_config == None:
< sampling_config = SamplingOptions()
<
---
> # Ensure proper device synchronization
> self.sync_if_needed()
>
> # Create sampling config if not provided
> if sampling_config is None:
> sampling_config = SamplingOptions(bsz=logits.size(0),
> device=logits.device)
> else:
> # Ensure sampling config is on the correct device
> sampling_config.to(logits.device)
>
> # Verify device consistency
> if logits.device != sampling_config.temperatures.device:
> raise ValueError(f"Device mismatch: logits on {logits.device}, "
> f"temperatures on {sampling_config.temperatures.device}")
>
68,89c97,100
< if sampling_config.is_all_greedy:
< # Use torch.argmax if all requests use greedy sampling
< probs = logits
< batch_next_token_ids = torch.argmax(logits, -1)
< else:
< # Post process logits
< logits.div_(sampling_config.temperatures)
< max_top_k_round, batch_size = 32, logits.shape[0]
< if sampling_config.need_min_p_sampling:
< probs = torch.softmax(logits, dim=-1)
< logits = None
< del logits
< probs = top_k_renorm_probs(probs, sampling_config.top_ks)
< probs = top_p_renorm_probs(probs, sampling_config.top_ps)
< batch_next_token_ids = min_p_sampling_from_probs(
< probs, sampling_config.min_ps
< )
< temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
< else:
< # TODO: use different kernel when don't need top_k or top_p
< # @TODO get probs
---
>
> try:
> if sampling_config.is_all_greedy:
> # Use torch.argmax if all requests use greedy sampling
91,98c102,152
< batch_next_token_ids = top_k_top_p_sampling_from_logits(
< logits,
< sampling_config.top_ks,
< sampling_config.top_ps,
< filter_apply_order="joint",
< )
< temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> batch_next_token_ids = torch.argmax(logits, -1)
> else:
> # Post process logits
> logits.div_(sampling_config.temperatures)
> max_top_k_round, batch_size = 32, logits.shape[0]
> if sampling_config.need_min_p_sampling:
> probs = torch.softmax(logits, dim=-1)
> logits = None
> del logits
> probs = top_k_renorm_probs(probs, sampling_config.top_ks)
> probs = top_p_renorm_probs(probs, sampling_config.top_ps)
> batch_next_token_ids = min_p_sampling_from_probs(
> probs, sampling_config.min_ps
> )
> temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
> else:
> # TODO: use different kernel when don't need top_k or top_p
> # @TODO get probs
> probs = logits
> batch_next_token_ids = top_k_top_p_sampling_from_logits(
> logits,
> sampling_config.top_ks,
> sampling_config.top_ps,
> filter_apply_order="joint",
> )
> temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
>
> # Add synchronization before the problematic operation
> if torch.cuda.is_available():
> torch.cuda.synchronize()
>
> # Safe temperature index calculation
> if sampling_config.temperatures.numel() > 0:
> # Use in-place operation to avoid creating intermediate tensors
> temperature_0_idx = (sampling_config.temperatures == 0).nonzero(as_tuple=True)[0]
>
> # Add synchronization after operation
> if torch.cuda.is_available():
> torch.cuda.synchronize()
>
> # Only process if indices are found
> if temperature_0_idx.numel() > 0:
> # Ensure indices are on correct device
> temperature_0_idx = temperature_0_idx.to(logits.device)
>
> # Safe indexing
> if temperature_0_idx.max() < origin_logits.shape[0]:
> greedy_tokens = torch.argmax(origin_logits[temperature_0_idx], -1)
> batch_next_token_ids[temperature_0_idx] = greedy_tokens.to(torch.int32)
100c154,164
< return batch_next_token_ids.to(torch.int32), probs
\ No newline at end of file
---
> return batch_next_token_ids.to(torch.int32), probs
>
> except RuntimeError as e:
> # Add detailed error information
> if "CUDA" in str(e):
> current_device = torch.cuda.current_device() if torch.cuda.is_available() else -1
> mem_info = torch.cuda.memory_summary() if torch.cuda.is_available() else "No CUDA memory info"
> logger.error(f"CUDA error occurred: {str(e)}")
> logger.error(f"Current device: {current_device}")
> logger.error(f"Memory info:\n{mem_info}")
> raise
let me check if that would solve the issue [...]
related:
https://github.com/kvcache-ai/ktransformers/issues/1269 https://github.com/kvcache-ai/ktransformers/issues/1324 https://github.com/kvcache-ai/ktransformers/issues/1376 https://github.com/kvcache-ai/ktransformers/issues/1062
Nope, still having the same issue.
CUDA error occurred: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Current device: 0
Memory info:
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 12121 MiB | 12400 MiB | 912 GiB | 900 GiB |
| from large pool | 12007 MiB | 12400 MiB | 881 GiB | 869 GiB |
| from small pool | 114 MiB | 118 MiB | 30 GiB | 30 GiB |
|---------------------------------------------------------------------------|
| Active memory | 12121 MiB | 12400 MiB | 912 GiB | 900 GiB |
| from large pool | 12007 MiB | 12400 MiB | 881 GiB | 869 GiB |
| from small pool | 114 MiB | 118 MiB | 30 GiB | 30 GiB |
|---------------------------------------------------------------------------|
| Requested memory | 12121 MiB | 12400 MiB | 911 GiB | 900 GiB |
| from large pool | 12007 MiB | 12400 MiB | 881 GiB | 869 GiB |
| from small pool | 114 MiB | 118 MiB | 30 GiB | 30 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 12526 MiB | 14182 MiB | 178026 MiB | 165500 MiB |
| from large pool | 12400 MiB | 14180 MiB | 177800 MiB | 165400 MiB |
| from small pool | 126 MiB | 126 MiB | 226 MiB | 100 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 2022 | 2026 | 814 K | 812 K |
| from large pool | 812 | 819 | 60 K | 59 K |
| from small pool | 1210 | 1214 | 754 K | 753 K |
|---------------------------------------------------------------------------|
| Active allocs | 2022 | 2026 | 814 K | 812 K |
| from large pool | 812 | 819 | 60 K | 59 K |
| from small pool | 1210 | 1214 | 754 K | 753 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize allocations | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize GPU segments | 0 | 0 | 0 | 0 |
|===========================================================================|
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
engine.loop()
~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
generated_tokens, probs = self.sampling( self.model_runner.output)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
generated_tokens, probs=self.sampler(logit, sample_options)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 128, in forward
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f59c5d7ef00 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111a7 (0x7f59c619a1a7 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1c3cb (0x7f59c61a53cb in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1771c (0x7f59c61a071c in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1be27 (0x7f59c61a4e27 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d646 (0x7f59c61a6646 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x31b9d (0x7f59c61bab9d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0x36ccda (0x7f59b856ccda in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x628158 (0x7f59b8828158 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x438b28 (0x7f59b8638b28 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #10: c10::TensorImpl::~TensorImpl() + 0x1c5 (0x7f59c5d5c145 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #11: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f59c5d5c1c9 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #12: <unknown function> + 0x6cfb48 (0x7f59b88cfb48 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x6cff1d (0x7f59b88cff1d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #14: /opt/ktransformers/.ktransformers/bin/python3() [0x52daa2]
frame #15: /opt/ktransformers/.ktransformers/bin/python3() [0x54247d]
frame #16: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #17: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #18: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #19: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #20: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #21: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #22: /opt/ktransformers/.ktransformers/bin/python3() [0x59c1af]
frame #23: /opt/ktransformers/.ktransformers/bin/python3() [0x59991b]
frame #24: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #25: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #26: /opt/ktransformers/.ktransformers/bin/python3() [0x5b3118]
frame #27: /opt/ktransformers/.ktransformers/bin/python3() [0x5cad16]
frame #28: _PyEval_EvalFrameDefault + 0x4bde (0x561a1e in /opt/ktransformers/.ktransformers/bin/python3)
frame #29: PyEval_EvalCode + 0xcc (0x64d9ac in /opt/ktransformers/.ktransformers/bin/python3)
frame #30: /opt/ktransformers/.ktransformers/bin/python3() [0x66da21]
frame #31: /opt/ktransformers/.ktransformers/bin/python3() [0x669a8c]
frame #32: /opt/ktransformers/.ktransformers/bin/python3() [0x65c015]
frame #33: /opt/ktransformers/.ktransformers/bin/python3() [0x65be2c]
frame #34: Py_RunMain + 0x2a9 (0x680df9 in /opt/ktransformers/.ktransformers/bin/python3)
frame #35: Py_BytesMain + 0x2b (0x63d36b in /opt/ktransformers/.ktransformers/bin/python3)
frame #36: <unknown function> + 0x29ca8 (0x7f59c8906ca8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: __libc_start_main + 0x85 (0x7f59c8906d65 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: _start + 0x21 (0x63c701 in /opt/ktransformers/.ktransformers/bin/python3)`
So the issue appears to be related to ... inability of the sample[r] to pick up the index of the tensor which has the zero temperature?
Let's try this one:
86c86,87
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
98c99,100
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
[EDIT]:
also as related to division by zero and NaN values (makes sense?):
74c74,76
< logits.div_(sampling_config.temperatures)
---
> #logits.div_(sampling_config.temperatures)
> safe_temperatures = sampling_config.temperatures.masked_fill(sampling_config.temperatures == 0, 1.0)
> logits.div_(safe_temperatures)
[EDIT2]:
so the aforementioned safetycheck would prevent the write operation into the batch_next_token_ids array. So .. why it wasn't implemented in the first place?
[EDIT3]:
lol still the same error
prefill_batch_i: 254,
Model execution time (GPU): 26489.635 ms, 0.038 tokens/s
2352
prefill_batch_i: 254,
Model execution time (GPU): 26605.902 ms, 0.038 tokens/s
73
prefill_batch_i: 254,
Model execution time (GPU): 26662.541 ms, 0.038 tokens/sProcess SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
engine.loop()
~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
generated_tokens, probs = self.sampling( self.model_runner.output)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
generated_tokens, probs=self.sampler(logit, sample_options)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 100, in forward
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f4c02ed9f00 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111a7 (0x7f4c02f6c1a7 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1c3cb (0x7f4c02f773cb in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1771c (0x7f4c02f7271c in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1be27 (0x7f4c02f76e27 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d646 (0x7f4c02f78646 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x31b9d (0x7f4c02f8cb9d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0x36ccda (0x7f4bf576ccda in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x628158 (0x7f4bf5a28158 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x438b28 (0x7f4bf5838b28 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #10: c10::TensorImpl::~TensorImpl() + 0x1c5 (0x7f4c02eb7145 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #11: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f4c02eb71c9 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #12: <unknown function> + 0x6cfb48 (0x7f4bf5acfb48 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x6cff1d (0x7f4bf5acff1d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #14: /opt/ktransformers/.ktransformers/bin/python3() [0x52daa2]
frame #15: /opt/ktransformers/.ktransformers/bin/python3() [0x54247d]
frame #16: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #17: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #18: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #19: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #20: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #21: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #22: /opt/ktransformers/.ktransformers/bin/python3() [0x59c1af]
frame #23: /opt/ktransformers/.ktransformers/bin/python3() [0x59991b]
frame #24: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #25: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #26: /opt/ktransformers/.ktransformers/bin/python3() [0x5b3118]
frame #27: /opt/ktransformers/.ktransformers/bin/python3() [0x5cad16]
frame #28: _PyEval_EvalFrameDefault + 0x4bde (0x561a1e in /opt/ktransformers/.ktransformers/bin/python3)
frame #29: PyEval_EvalCode + 0xcc (0x64d9ac in /opt/ktransformers/.ktransformers/bin/python3)
frame #30: /opt/ktransformers/.ktransformers/bin/python3() [0x66da21]
frame #31: /opt/ktransformers/.ktransformers/bin/python3() [0x669a8c]
frame #32: /opt/ktransformers/.ktransformers/bin/python3() [0x65c015]
frame #33: /opt/ktransformers/.ktransformers/bin/python3() [0x65be2c]
frame #34: Py_RunMain + 0x2a9 (0x680df9 in /opt/ktransformers/.ktransformers/bin/python3)
frame #35: Py_BytesMain + 0x2b (0x63d36b in /opt/ktransformers/.ktransformers/bin/python3)
frame #36: <unknown function> + 0x29ca8 (0x7f4c05a0fca8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: __libc_start_main + 0x85 (0x7f4c05a0fd65 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: _start + 0x21 (0x63c701 in /opt/ktransformers/.ktransformers/bin/python3)
so its apparently related to the sync issue between the cpu and gpu huh?
[EDIT4]:
okay lets try this one:
66c66,74
< logits = logits.contiguous()
---
> #logits = logits.contiguous()
> # Ensure all tensors are on the same device
> device = logits.device
> logits = logits.contiguous().to(device)
> sampling_config.temperatures = sampling_config.temperatures.to(device)
>
> # Add synchronization
> torch.cuda.synchronize()
>
74c82,84
< logits.div_(sampling_config.temperatures)
---
> #logits.div_(sampling_config.temperatures)
> safe_temperatures = sampling_config.temperatures.masked_fill(sampling_config.temperatures == 0, 1.0)
> logits.div_(safe_temperatures)
84a95
> torch.cuda.synchronize()
86c97,98
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
96a109
> torch.cuda.synchronize()
98c111,112
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
Apparently the solution would look like:
./ktransformers/server/balance_serve/inference/sampling/sampler.py
66c66,71
< logits = logits.contiguous()
---
> #logits = logits.contiguous()
> # Ensure all tensors are on the same device
> device = logits.device
> logits = logits.contiguous().to(device)
> sampling_config.temperatures = sampling_config.temperatures.to(device)
>
74c79,81
< logits.div_(sampling_config.temperatures)
---
> #logits.div_(sampling_config.temperatures)
> safe_temperatures = sampling_config.temperatures.masked_fill(sampling_config.temperatures == 0, 1.0)
> logits.div_(safe_temperatures)
84a92
> torch.cuda.synchronize()
86c94,95
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
96a106
> torch.cuda.synchronize()
98c108,109
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
100c111
< return batch_next_token_ids.to(torch.int32), probs
\ No newline at end of file
---
> return batch_next_token_ids.to(torch.int32), probs
Explanation:
### Why the Synchronization is Needed Where You've Kept It
1. CUDA Graph Safety: The synchronization right before torch.
where(sampling_config.temperatures == 0)[0] is crucial because:
• The temperatures tensor might have been modified by previous
operations that ran asynchronously on the GPU
• torch.where() is a synchronization point that needs consistent data
2. Race Condition Prevention: Without this sync, there could be a race
condition where:
• The GPU is still processing the temperature modifications
• The CPU tries to read the temperatures for the where() operation
before they're ready
3. Correct Placement: Your sync calls are now optimally placed - right
before the operations that actually need the synchronized state
[EDIT]: not sure if torch.cuda.synchronize() is necessary since the .to() suppose to do the sync.
lol no errors yet
So what does that all mean? All the time there was a race condition in the balance_serve backend sampler of ktransformers which leads to the invalid memory read? Is it correct?
[UPDATE]: a different error!
decode_batch_i: 1,
Model execution time (GPU): 220.592 ms, 4.533 tokens/s
722
decode_batch_i: 1,
Model execution time (GPU): 220.786 ms, 4.529 tokens/sCUDA Error: an illegal memory access was encountered (700) /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/flashinfer/data/include/flashinfer/sampling.cuh: line 927 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine
engine.loop()
~~~~~~~~~~~^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 260, in loop
generated_tokens, probs = self.sampling( self.model_runner.output)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 224, in sampling
generated_tokens, probs=self.sampler(logit, sample_options)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py", line 100, in forward
batch_next_token_ids = top_k_top_p_sampling_from_logits(
logits,
...<2 lines>...
filter_apply_order="joint",
)
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/flashinfer/sampling.py", line 818, in top_k_top_p_sampling_from_logits
return get_sampling_module().top_k_top_p_sampling_from_probs(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
probs,
^^^^^^
...<4 lines>...
generator,
^^^^^^^^^^
)
^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/flashinfer/sampling.py", line 214, in top_k_top_p_sampling_from_probs
module.top_k_top_p_sampling_from_probs.default(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
probs,
^^^^^^
...<7 lines>...
generator,
^^^^^^^^^^
)
^
File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/_ops.py", line 829, in __call__
return self._op(*args, **kwargs)
~~~~~~~~^^^^^^^^^^^^^^^^^
RuntimeError: TopKTopPSamplingFromProbs failed with error code an illegal memory access was encountered
terminate called after throwing an instance of 'c10::AcceleratorError'
what(): CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fdcda0d9f00 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111a7 (0x7fdcda16c1a7 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1c3cb (0x7fdcda1773cb in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1771c (0x7fdcda17271c in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1be27 (0x7fdcda176e27 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d646 (0x7fdcda178646 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x31b9d (0x7fdcda18cb9d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0x36ccda (0x7fdccc96ccda in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x628158 (0x7fdcccc28158 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x438b28 (0x7fdccca38b28 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #10: c10::TensorImpl::~TensorImpl() + 0x1c5 (0x7fdcda0b7145 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #11: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fdcda0b71c9 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libc10.so)
frame #12: <unknown function> + 0x6cfb48 (0x7fdcccccfb48 in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x6cff1d (0x7fdcccccff1d in /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/lib/libtorch_python.so)
frame #14: /opt/ktransformers/.ktransformers/bin/python3() [0x52daa2]
frame #15: /opt/ktransformers/.ktransformers/bin/python3() [0x54247d]
frame #16: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #17: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #18: /opt/ktransformers/.ktransformers/bin/python3() [0x54219a]
frame #19: /opt/ktransformers/.ktransformers/bin/python3() [0x5cb058]
frame #20: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #21: /opt/ktransformers/.ktransformers/bin/python3() [0x5cab4a]
frame #22: /opt/ktransformers/.ktransformers/bin/python3() [0x59c1af]
frame #23: /opt/ktransformers/.ktransformers/bin/python3() [0x59991b]
frame #24: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #25: /opt/ktransformers/.ktransformers/bin/python3() [0x59996f]
frame #26: /opt/ktransformers/.ktransformers/bin/python3() [0x5b3118]
frame #27: _PyEval_EvalFrameDefault + 0x4bde (0x561a1e in /opt/ktransformers/.ktransformers/bin/python3)
frame #28: PyEval_EvalCode + 0xcc (0x64d9ac in /opt/ktransformers/.ktransformers/bin/python3)
frame #29: /opt/ktransformers/.ktransformers/bin/python3() [0x66da21]
frame #30: /opt/ktransformers/.ktransformers/bin/python3() [0x669a8c]
frame #31: /opt/ktransformers/.ktransformers/bin/python3() [0x65c015]
frame #32: /opt/ktransformers/.ktransformers/bin/python3() [0x65be2c]
frame #33: Py_RunMain + 0x2a9 (0x680df9 in /opt/ktransformers/.ktransformers/bin/python3)
frame #34: Py_BytesMain + 0x2b (0x63d36b in /opt/ktransformers/.ktransformers/bin/python3)
frame #35: <unknown function> + 0x29ca8 (0x7fdcdcc17ca8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x85 (0x7fdcdcc17d65 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: _start + 0x21 (0x63c701 in /opt/ktransformers/.ktransformers/bin/python3)
Another error lol
Jul 05 13:35:11 xxx run-ktransformers.sh[897358]: 2025-07-05 13:35:11,870 DEBUG /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/inter faces/balance_serve.py[418]: get input ids of shape torch.Size([1, 86410])
Jul 05 13:35:12 xxx run-ktransformers.sh[897358]: INFO: 192.168.1.144:34744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Jul 05 13:35:12 xxx run-ktransformers.sh[897358]: 2025-07-05 13:35:12,314 DEBUG /opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/inter
faces/balance_serve.py[418]: get input ids of shape torch.Size([1, 86410])
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: Process SpawnProcess-1:
B
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: Traceback (most recent call last):
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/usr/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self.run()
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/usr/lib/python3.13/multiprocessing/process.py", line 108, in run
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self._target(*self._args, **self._kwargs)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py
", line 282, in run_engine
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: engine.loop()
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/balance_serve.py
", line 234, in loop
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self.model_runner.run(self.batch, self.query_manager)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/model_runne
r.py", line 202, in run
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self.model_attn_plan(self.input[cuda_graph_idx], cuda_graph_idx)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/model_runne
r.py", line 92, in model_attn_plan
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self.model.flash_infer_attn_plan(batch, self.bsz_tensor_buf, self.num_tokens_tensor_buf,
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: num_heads=self.model.config.num_attention_heads, head_dim_ckv=self.model.config.kv_lora_ra
nk,
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: head_dim_kpe=self.model.config.qk_rope_head_dim, page_size=self.model.cache.page_size, cau
sal=True,
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: sm_scale=self.model.model.layers[0].self_attn.softmax_scale, q_data_type=torch.bfloat16, k
v_data_type=torch.bfloat16)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py", li
ne 146, in flash_infer_attn_plan
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self.wrapper.plan(minibatch.q_indptr, minibatch.kv_indptr, minibatch.kv_indices,
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: minibatch.kv_len, num_heads, head_dim_ckv, head_dim_kpe, page_size, causal, sm_scale, q_data_type, kv_dat
a_type, bsz_tensors)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/flashinfer/mla.py", line 248, in plan
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: self._kv_indices_buf[: len(kv_indices)].copy_(kv_indices, non_blocking=True)
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 05 13:35:38 xxx run-ktransformers.sh[897447]: RuntimeError: The size of tensor a (8192) must match the size of tensor b (10917) at non-singleton dimension 0
Jul 05 13:35:47 xxx run-ktransformers.sh[897447]: Model execution time (GPU): 202.588 ms, 4.936 tokens/s
Jul 05 13:35:47 xxx run-ktransformers.sh[897447]: 274
Jul 05 13:35:47 xxx run-ktransformers.sh[897447]: 603
Jul 05 13:35:47 xxx run-ktransformers.sh[897447]: decode_batch_i: 2,
[UPDATE]: I will try to vibecode myself out of bugs with DeepSeek V3/R1 UD_Q2/Q4 quants. But lol if the deepseek is right related to the interpretation of the bugs then something is truly wrong with the codebase of the ktransformers. Its a great framework, dont get me wrong, but the stability issues is driving me nuts. I suggest that we have to fix it. May be we could implement some unit tests to see how different backends behave in different environments (prefill length, max tokens, max batches etc. etc.)? Otherwise a lot of people are catching a lot of annoying bugs which is a major inconvenience!
[UPDATE2:] Holy cow! it apparently worked! no crashes yet! I will post the exact fixes later on, after the thorough hardcore testing. Let me know if I should provide the exact patches (which are quite trivial BTW) or make a pull request or something.
I will provide the draft for the fix below. The Deepseek R1/V3 with long context doesn't output any gibberish or any runtime error [anymore] thanks to this fix. It seems to work fine. Just in case if anyone would like to help me out debugging further they are welcome to check this out:
diff .ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py.bak .ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py
45c45,46
< self.workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8).to(0)
---
> # Increase buffer sizes to be safe
> self.workspace_buffer = torch.empty(256 * 1024 * 1024, dtype=torch.int8).to(0)
48c49,50
< self.paged_kv_indices_buf = torch.empty((max_pages,), dtype=torch.int32, device=device)
---
> # Make sure this buffer is large enough
> self.paged_kv_indices_buf = torch.empty((max_pages * 2,), dtype=torch.int32, device=device)
51d52
<
55,56c56,59
< qo_indptr=self.qo_indptr_buf,kv_indptr=self.paged_kv_indptr_buf,
< kv_indices=self.paged_kv_indices_buf,kv_len_arr=self.paged_kv_len_buf,
---
> qo_indptr=self.qo_indptr_buf,
> kv_indptr=self.paged_kv_indptr_buf,
> kv_indices=self.paged_kv_indices_buf,
> kv_len_arr=self.paged_kv_len_buf,
58c61
< backend = "fa2",
---
> backend="fa2",
148c151
<
\ No newline at end of file
---
>
diff .ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py.bak .ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py
55c55
< def __init__(self):
---
> def __init__(self, device=torch.device('cuda')):
56a57
> self.device = device
66c67,71
< logits = logits.contiguous()
---
> # Ensure all tensors are on the same device
> device = logits.device
> logits = logits.contiguous().to(device)
> sampling_config.temperatures = sampling_config.temperatures.to(device)
>
74c79,80
< logits.div_(sampling_config.temperatures)
---
> safe_temperatures = sampling_config.temperatures.masked_fill(sampling_config.temperatures == 0, 1.0)
> logits.div_(safe_temperatures)
84a91
> torch.cuda.synchronize()
86c93,94
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
96a105
> torch.cuda.synchronize()
98c107,108
< batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
---
> if temperature_0_idx.numel() > 0:
> batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
100c110
< return batch_next_token_ids.to(torch.int32), probs
\ No newline at end of file
---
> return batch_next_token_ids.to(torch.int32), probs
~~DISCLAIMER ATTN! The code above IS NOT ready for any deployment! The code has been written by the LLM so I am not (yet) responsible for any of it!~~
The code is pretty much ready for the deployment.
@createthis
the urgent help of the nitro coffee drinkers required!!!
I’ll take a look after dinner!
No crashes with the latest fix (above). Also not sure why the load of the cache from the storage device is turned off by default. Here is how to enable it:
jq '.load_from_disk = true' /mnt/data/kvc/config.json | sponge /mnt/data/kvc/config.json
the prefix caching is working great (even after the daemon restart). The only thing left to do is the management of the cache etc.
@magikRUKKOLA Sorry I didn't get to it last night. I tried to apply these diffs this morning and my patch tool is saying:
patch: **** Only garbage was found in the patch input.
I tried having ChatGPT convert the diffs to unified format, but there's a section of the second diff that has me wondering if we're patching from the same commit of the code. There's no context in your diff at that spot, so I have no idea where my chunk should go.
Can you do one of these, please:
1.) fork ktransformers to your github, then create a branch and push it, then let me know the branch so I can check it out
or
2.) Reply with the commit your code is currently at ( git log copy paste the first commit hash) and then reply with unified diffs git diff or git diff -uwb or diff -u <old> <new>.
Thanks!
@createthis
these are with diff -u
--- .ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py.bak 2025-07-03 18:59:17.698822263 +0000
+++ .ktransformers/lib/python3.13/site-packages/ktransformers/server/balance_serve/inference/sampling/sampler.py 2025-07-06 15:06:34.191635341 +0000
@@ -52,8 +52,9 @@
self.is_all_greedy = False
class Sampler(nn.Module):
- def __init__(self):
+ def __init__(self, device=torch.device('cuda')):
super().__init__()
+ self.device = device
def forward(
self,
@@ -63,7 +64,11 @@
if sampling_config == None:
sampling_config = SamplingOptions()
- logits = logits.contiguous()
+ # Ensure all tensors are on the same device
+ device = logits.device
+ logits = logits.contiguous().to(device)
+ sampling_config.temperatures = sampling_config.temperatures.to(device)
+
origin_logits = logits.clone()
if sampling_config.is_all_greedy:
# Use torch.argmax if all requests use greedy sampling
@@ -71,7 +76,8 @@
batch_next_token_ids = torch.argmax(logits, -1)
else:
# Post process logits
- logits.div_(sampling_config.temperatures)
+ safe_temperatures = sampling_config.temperatures.masked_fill(sampling_config.temperatures == 0, 1.0)
+ logits.div_(safe_temperatures)
max_top_k_round, batch_size = 32, logits.shape[0]
if sampling_config.need_min_p_sampling:
probs = torch.softmax(logits, dim=-1)
@@ -82,8 +88,10 @@
batch_next_token_ids = min_p_sampling_from_probs(
probs, sampling_config.min_ps
)
+ torch.cuda.synchronize()
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
- batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
+ if temperature_0_idx.numel() > 0:
+ batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
else:
# TODO: use different kernel when don't need top_k or top_p
# @TODO get probs
@@ -94,7 +102,9 @@
sampling_config.top_ps,
filter_apply_order="joint",
)
+ torch.cuda.synchronize()
temperature_0_idx = torch.where(sampling_config.temperatures == 0)[0]
- batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
+ if temperature_0_idx.numel() > 0:
+ batch_next_token_ids[temperature_0_idx] = torch.argmax(origin_logits[temperature_0_idx], -1).to(torch.int32)
- return batch_next_token_ids.to(torch.int32), probs
\ No newline at end of file
+ return batch_next_token_ids.to(torch.int32), probs
--- .ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py.bak 2025-07-05 17:42:32.784284649 +0000
+++ .ktransformers/lib/python3.13/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py 2025-07-05 17:12:08.724268395 +0000
@@ -42,20 +42,23 @@
def init_wrapper(self, use_cuda_graph, device, max_batch_size, max_pages):
self.use_cuda_graph = use_cuda_graph
- self.workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8).to(0)
+ # Increase buffer sizes to be safe
+ self.workspace_buffer = torch.empty(256 * 1024 * 1024, dtype=torch.int8).to(0)
self.qo_indptr_buf = torch.empty((max_batch_size+2,), dtype=torch.int32, device=device)
self.paged_kv_indptr_buf = torch.empty((max_batch_size+2,), dtype=torch.int32, device=device)
- self.paged_kv_indices_buf = torch.empty((max_pages,), dtype=torch.int32, device=device)
+ # Make sure this buffer is large enough
+ self.paged_kv_indices_buf = torch.empty((max_pages * 2,), dtype=torch.int32, device=device)
self.paged_kv_len_buf = torch.empty((max_batch_size+1,), dtype=torch.int32, device=device)
self.bsz_tensor_buf = torch.empty((1, ), dtype=torch.int32, device=device)
-
self.wrapper = flashinfer.mla.BatchMLAPagedAttentionWrapper(
self.workspace_buffer, use_cuda_graph=use_cuda_graph,
- qo_indptr=self.qo_indptr_buf,kv_indptr=self.paged_kv_indptr_buf,
- kv_indices=self.paged_kv_indices_buf,kv_len_arr=self.paged_kv_len_buf,
+ qo_indptr=self.qo_indptr_buf,
+ kv_indptr=self.paged_kv_indptr_buf,
+ kv_indices=self.paged_kv_indices_buf,
+ kv_len_arr=self.paged_kv_len_buf,
bsz_tensor=self.bsz_tensor_buf,
- backend = "fa2",
+ backend="fa2",
)
def batch_embeddings(self, batch: ForwardBatchInput, device="cuda:0"):
@@ -145,4 +148,4 @@
minibatch = batch.minibatch
self.wrapper.plan(minibatch.q_indptr, minibatch.kv_indptr, minibatch.kv_indices,
minibatch.kv_len, num_heads, head_dim_ckv, head_dim_kpe, page_size, causal, sm_scale, q_data_type, kv_data_type, bsz_tensors)
-
\ No newline at end of file
+
Ah. There are mixed tab and space characters in those files and therefore the diffs that are making them hard to apply - probably because github or markdown is stripping the tab characters out. I had to apply them manually. Python. 🙄
I've put this into a branch on my fork and issued a PR for it: https://github.com/kvcache-ai/ktransformers/pull/1422
It's still compiling on my end. I'll test as soon as it's done.
Tested with:
python ktransformers/server/main.py \
--port 11434 \
--model_path /data/DeepSeek-V3 \
--model_name "DeepSeek-V3-0324:671b-q4_k_xl" \
--gguf_path /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL \
--optimize_config_path /home/jesse/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--temperature 0.3 \
--cpu_infer 30 \
--cache_lens 131072 \
--chunk_size 256 \
--max_new_tokens 1024 \
--backend_type ktransformers
I realize this is mostly a balance_serve PR, but all of my testing to date has focused on the ktransformers backend, so that's all I can test with confidence.
flow1.sh still crashes the ktransformers backend in one shot with this patch applied:
jesse@Jesses-MacBook-Pro ktransformers_8dc1ab9_bug_deepseek_v3_0324 % ./flow1.sh
Internal Server Error%
The backtrace looks like this:
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/ktransformers/server/api/openai/endpoints/chat.py", line 436, in chat_completion
async for res in interface.inference(input_message, id, create.temperature, create.top_p, create.max_tokens, create.max_completion_tokens):
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 241, in inference
async for v in super().inference(local_messages, thread_id, temperature, top_p, max_tokens, max_completion_tokens):
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 466, in inference
for t in self.prefill(input_ids, self.check_is_new(thread_id), temperature, top_p, max_tokens, max_completion_tokens):
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 230, in prefill
next_token = self.logits_to_token(logits[0, -1, :])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jesse/anaconda3/envs/ktransformers_head_cu128/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 295, in logits_to_token
last = torch.multinomial(probs, num_samples=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
So, it didn't really solve the issues I've identified, but it may solve other issues that don't affect me, my hardware, and environment. HTH.
I realize this is mostly a
balance_servePR, but all of my testing to date has focused on thektransformersbackend, so that's all I can test with confidence.
I see your point. Recently I was doing the same but with the newest release ktransformers introduced the on-storage prefix caching which is available only with the balance_serve backend. So all my testing was focused only on balance_serve backend.
flow1.sh still crashes the
ktransformersbackend in one shot with this patch applied:
Okay great! Its great to have the bug-case isolated as a curl command. Just for fun I will try to reproduce your bug-case with the ktransformers legacy backend today.
@createthis
yeah, the ktransformers is crashing indeed.
[68/1344]
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/responses.py", line 263, in __call__
| async with anyio.create_task_group() as task_group:
| ~~~~~~~~~~~~~~~~~~~~~~~^^
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
| raise BaseExceptionGroup(
| "unhandled errors in a TaskGroup", self._exceptions
| ) from None
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| self.scope, self.receive, self.send
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| )
| ^
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/applications.py", line 112, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/middleware/errors.py", line 187, in __call__
| raise exc
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/middleware/errors.py", line 165, in __call__
| await self.app(scope, receive, _send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/middleware/cors.py", line 85, in __call__
| await self.app(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/routing.py", line 714, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/routing.py", line 734, in app
| await route.handle(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/routing.py", line 288, in handle
| await self.app(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/routing.py", line 76, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/routing.py", line 74, in app
| await response(scope, receive, send)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/responses.py", line 262, in __call__
| with collapse_excgroups():
| ~~~~~~~~~~~~~~~~~~^^
| File "/usr/lib/python3.13/contextlib.py", line 162, in __exit__
| self.gen.throw(value)
| ~~~~~~~~~~~~~~^^^^^^^
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
| raise exc
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/responses.py", line 266, in wrap
| await func()
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/starlette/responses.py", line 246, in stream_response
| async for chunk in self.body_iterator:
| ...<2 lines>...
| await send({"type": "http.response.body", "body": chunk, "more_body": True})
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
| async for event in async_events:
| ...<2 lines>...
| yield event
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
| async for event in async_events:
| ...<3 lines>...
| yield event.to_stream_reply()
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
| async for event in async_events:
| yield event
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 107, in filter_chat_chunk
| async for event in async_events:
| if isinstance(event, ChatCompletionChunk):
| yield event
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/api/openai/endpoints/chat.py", line 266, in inner
| async for res in interface.inference(input_message, id, create.temperature, create.top_p, create.max_tokens, create.max_completion_tokens):
| ...<135 lines>...
| yield chunk
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 241, in inference
| async for v in super().inference(local_messages, thread_id, temperature, top_p, max_tokens, max_completion_tokens):
| yield v
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 466, in inference
| for t in self.prefill(input_ids, self.check_is_new(thread_id), temperature, top_p, max_tokens, max_completion_tokens):
| ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
| response = gen.send(None)
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 230, in prefill
| next_token = self.logits_to_token(logits[0, -1, :])
| File "/opt/ktransformers/.ktransformers/lib/python3.13/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 295, in logits_to_token
| last = torch.multinomial(probs, num_samples=1)
| torch.AcceleratorError: CUDA error: device-side assert triggered
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
|
+------------------------------------
I am pretty sure that the problem is similar to what is going on with balance_serve and R1/V3 can fix it. But should we do that? The thing is the ktransformers legacy backend is getting deprecated.
@createthis
yeah itis either crashing (above) or outputting the garbage with ktransformers legacy backend.
in fact, its seems like it wants to rape someone:
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Ouvrardాలుాలుాలుాలుాలుాలుాలుాలుాలుrapeutాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుా
ుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుrapeutాలుాలుాలుాలుాలుాలుాలుాలుాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Ouvrardాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుrapeutాలు
"},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Paglinాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Ouvrardాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Ouvrardాలుాలుాలుాలుాలుాలుాలుాలుాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Bourgoinాలుాలుrapeutాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
data: {"id":"d41995247d054bdf8f0d2549d8048634","choices":[{"index":0,"delta":{"content":"Ouvrardాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలుాలు "},"finish_reason":null}],"created":1751876661,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_9779974fc781","usage":null}
@createthis
that's the output from the balance_serve backend with our input (with on-storage prefix cache enabled):
(no crashes, no gibberish)
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"I'll "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"start "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"by "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"reading "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"the "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"first "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"required "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"file "},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"`/workspace/WHI.Web.ClientUI/wwwroot/js/pages/care/watch.js`:\n\n"},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"<function=read_file>\n"},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"<parameter=path>/workspace/WHI.Web.ClientUI/wwwroot/js/pages/care/watch.js</parameter>\n"},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{"content":"</function>"},"finish_reason":null}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":null}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":{"prompt_tokens":7936,"completion_tokens":65,"total_tokens":8001,"prompt_tokens_details":null,"completion_tokens_details":null}}
data: {"id":"fc415af8b6d64cef82e967eaf687ee2f","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"created":1751879117,"model":"unsloth/DeepSeek-V3-0324-GGUF","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":"fp_f68a31478966","usage":{"prompt_tokens":7936,"completion_tokens":65,"total_tokens":8001,"prompt_tokens_details":null,"completion_tokens_details":null}}
data: [DONE]
I am pretty sure that the problem is similar to what is going on with balance_serve and R1/V3 can fix it. But should we do that? The thing is the ktransformers legacy backend is getting deprecated.
My opinion as an old guy who has worked on a lot of software projects over the years is that this project needs unit tests that prevent this sort of thing from happening and CI/CD to enforce it. If it had that, the ktransformers backend could be maintained in a fork or something if they don’t want to support it. However, without the unit tests and CI/CD, this sort of thing will always happen and the ktransformers project will be unstable. I’ve switched to llama.cpp for this reason.
My opinion as an old guy who has worked on a lot of software projects over the years is that this project needs unit tests that prevent this sort of thing from happening and CI/CD to enforce it. If it had that, the ktransformers backend could be maintained in a fork or something if they don’t want to support it. However, without the unit tests and CI/CD, this sort of thing will always happen and the ktransformers project will be unstable. I’ve switched to llama.cpp for this reason.
Lets see if the authors will merge the PR you made. If not, I am forking it lol and doing everything properly.
Lets see if the authors will merge the PR you made. If not, I am forking it lol and doing everything properly.
I wish I knew enough about LLM architecture to help. I honestly can't tell if the PR we made solves anything or if it just moves the problem around a bit like a mop with dirty water.
Lets see if the authors will merge the PR you made. If not, I am forking it lol and doing everything properly.
I wish I knew enough about LLM architecture to help. I honestly can't tell if the PR we made solves anything or if it just moves the problem around a bit like a mop with dirty water.
Well, at least a cannot crash the ktransformers with balance_serve backend anymore, nor it outputs some gibberish as before. The problem appears to be related to how the sampler interacts with GPU->CPU during the inference (and it increases the buffer to deal with 128k context). I mean, any kind of unit test would catch it before the release. I have no idea if anyone in ktransformers tested it before the release. Apparently not.
llama.cpp is too slow. Have you tried ik_llama.cpp with _R4 quants? It works stable but it doesn't have the tool calls or the prefix on-storage caching. So ktransformers is actually more progressive than any other framework, but it needs some polishing.
llama.cpp is faster for me in tok/s than ktransformers, due to the way it handles NUMA. I can run it with NPS4 and tune the number of CPU threads to be optimal instead of NPS0 under ktransformers. I did prefer the prefill cache in the ktransformers backend, but that's not support now either, so I really no longer have a dog in the race.
llama.cpp does use more VRAM than ktransformers, but I have plenty of VRAM since I upgraded from a 3090 to a blackwell 6000 pro.
I have not tried ik_llama.cpp. I don't like that it uses yet another file format. I have limited disk space, so I'm sticking to GGUF for now.