fairydreaming

Results 85 comments of fairydreaming

> > Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this)....

> @fairydreaming Is there any reason this should cause issues with RPC. Encountered: > > ``` > ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to...

> Sorry for what is likely a silly question, but does this have an impact on KV cache size when using full offload with CUDA? Because that would be very...

> I was under the impression that MLA was specifically used _because_ it uses way lower amounts of RAM, ~500+ GB of RAM for 160k context size is not really...

> > ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU > > evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT) > > [...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed >...

> The problem is that the CUDA backend does not support broadcasting on the 4th dimension. The error says: > > > src0->ne[3] = 1, src1->ne[3] = 2 > >...

> @fairydreaming Is there any reason this should cause issues with RPC. Encountered: > > ``` > ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to...

> @fairydreaming > > Thank you for your excellent work. I'm getting almost the same performance (around 7 tokens/s) on a 9654 machine. I tried increasing `--threads` from 32 to...

> > The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let...

@jukofyork It's not a matter of resolving the conflicts. Since #12181 is now merged the code on which I based this little hack is no longer there. It would have...