Georgi Gerganov

Results 420 comments of Georgi Gerganov

Currently, there is no way to disable the GPU completely when the project is built with OpenCL support. Will think about fixing this. In the meantime, does the information from...

You can easily update `ggml.c` to avoid all GPU calls (CUDA, OpenCL, etc.) if a global flag is set. For example here: https://github.com/ggerganov/whisper.cpp/blob/1f50a7d29f85f221368e81201780e0c8dd631076/ggml.c#L9816-L9825 You can add a `void ggml_gpu_set(bool enable);`...

I'll probably make a new one soon, yes

@ikawrakow Just made a full cuBLAS run on 13B using `Q4_3`, without RMSE optimization and `output` in F16 precision and got: `5.3075` ``` main: seed = 1682170268 llama.cpp: loading model...

My result for 13B, using `Q4_3` with RMSE opt. + F16 output is: `5.2962` This result I think makes more sense since it is inline with my expectation that I...

> @ggerganov Are these results with or without the changes you made to `Q4_3` after I opened this PR (and reported the results)? It includes all changes from today related...

I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes `x` to F16 and casts `y` to F16 and performs F16 mat mul, while...

@ExtReMLapin This copy is used only in the `speculative` example. Even if it helps there, it won't have any effect on the general use case. Still, a PR is welcome...

> if I were to make improvements to the grammar engine, would those speed improvements show up in our current bank of benchmarks? We don't have benchmarks for this yet....

This breaks the "real-time" stream usage. For example see the videos here: https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.nvim