Diego Devesa
Diego Devesa
Somebody with access to dual 7900 XTX would need to diagnose the issue. AFAIK nobody who is working on the CUDA/HIP backend at the moment has access to this hardware.
Can you test if it works with this change? (do not use `-sm row`). ```diff diff --git a/ggml-cuda.cu b/ggml-cuda.cu index 04c6f5d0..06af740e 100644 --- a/ggml-cuda.cu +++ b/ggml-cuda.cu @@ -797,7 +797,7 @@...
The parallel example has a few hard-coded stop strings, including the new line. You are also limiting the sequences to 100 tokens with `-n 100`. https://github.com/ggerganov/llama.cpp/blob/633782b8d949f24b619e6c68ee37b5cc79167173/examples/parallel/parallel.cpp#L357-L361
@whoreson this is getting a bit tiresome. Are you going to ask people to harass me over this again? Let's be clear: I have no interest in supporting ancient versions...
Here is a quick test of the performance impact of quantization on LoRA: ``` perf_total_per_op_us[ ADD] = 12.776 ms perf_total_per_op_us[ MUL_MAT] = 47.818 ms perf_total_per_op_us[ SCALE] = 9.319 ms perf_total_per_op_us[...
That said, I absolutely agree that `ggml.c` is too big and any simplification would be good. If you not opposed to splitting ggml into multiple files, we could look into...
>But in any case, 0.5s for applying a LoRA adapter does not sound bad at all. This is just for a single tensor, applying it to the entire model can...
You can enclose the incompatible code in a `#if CUDART_VERSION >= 11100` block or similar, and return false for older versions. A PR to do this would be welcome.
That would be up to @ggerganov , but I see no issue with it as long as it is optional.
As it is, this is going to break every user of ggml-cuda other than llama.cpp, and even within llama.cpp this approach will fail to detect changes to the graph in...