Diego Devesa

Results 361 comments of Diego Devesa

With the OpenCL backend, the CPU threads are running in a spin lock while the matrix multiplication is running in the GPU. This is of course very bad for performance...

The OpenCL backend hooks into the CPU backend and takes control of the execution of some ops. Other backends implement the ggml-backend interface, and the CPU backend isn't running at...

The CPU and GPU backends do not work in parallel, they process each a different part of the model each in sequence. Generally what happens is that the input layer...

I don't think there is anything about OpenCL that would prevent creating a better backend implementation, it just hasn't been updated much since it was added to ggml.

I tried to test this with this same model, but I am not able to reproduce this on my system with 3090+3080. Maybe running with `compute-sanitizer` will show more details.

When I made the PR for --ignore-eos the code that ignores eos in interactive mode wasn't added yet. However I think that my solution is better because it avoids sampling...

> We could then remove this flag (or maybe it has other uses so we could keep it) I still find it useful outside of interactive mode to force the...

There are functions in the llama.h API to read the metadata. It should work with any non-array metadata. https://github.com/ggerganov/llama.cpp/blob/8084d554406b767d36b3250b3b787462d5dd626f/llama.h#L357-L367

It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking...

You may be thinking of a [library](https://www.npmjs.com/package/@huggingface/gguf) that Huggingface released that can read GGUF *metadata* without downloading the whole file. You wouldn't gain much from streaming the model for inference,...