Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

multi-gpu inference produces broken output

Somebody with access to dual 7900 XTX would need to diagnose the issue. AFAIK nobody who is working on the CUDA/HIP backend at the moment has access to this hardware.

multi-gpu inference produces broken output

Can you test if it works with this change? (do not use `-sm row`). ```diff diff --git a/ggml-cuda.cu b/ggml-cuda.cu index 04c6f5d0..06af740e 100644 --- a/ggml-cuda.cu +++ b/ggml-cuda.cu @@ -797,7 +797,7 @@...

/parallel often produces truncated outputs

The parallel example has a few hard-coded stop strings, including the new line. You are also limiting the sequences to 100 tokens with `-n 100`. https://github.com/ggerganov/llama.cpp/blob/633782b8d949f24b619e6c68ee37b5cc79167173/examples/parallel/parallel.cpp#L357-L361

does not compile on CUDA 10 anymore

@whoreson this is getting a bit tiresome. Are you going to ask people to harass me over this again? Let's be clear: I have no interest in supporting ancient versions...

New Q4_0 implementation using 2x F16 instead of 1x F32

Here is a quick test of the performance impact of quantization on LoRA: ``` perf_total_per_op_us[ ADD] = 12.776 ms perf_total_per_op_us[ MUL_MAT] = 47.818 ms perf_total_per_op_us[ SCALE] = 9.319 ms perf_total_per_op_us[...

New Q4_0 implementation using 2x F16 instead of 1x F32

That said, I absolutely agree that `ggml.c` is too big and any simplification would be good. If you not opposed to splitting ggml into multiple files, we could look into...

New Q4_0 implementation using 2x F16 instead of 1x F32

>But in any case, 0.5s for applying a LoRA adapter does not sound bad at all. This is just for a single tensor, applying it to the entire model can...

Broken support for CUDA versions < 11.1

You can enclose the incompatible code in a `#if CUDART_VERSION >= 11100` block or similar, and return false for older versions. A PR to do this would be welcome.

can the whisper stream support input audio files? like pcm, wav ... format .

That would be up to @ggerganov , but I see no issue with it as long as it is optional.

DRAFT: Introduction of CUDA Graphs to LLama.cpp

As it is, this is going to break every user of ggml-cuda other than llama.cpp, and even within llama.cpp this approach will fail to detect changes to the graph in...