Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

Correlation between cpu threads and n-gpu-layers

With the OpenCL backend, the CPU threads are running in a spin lock while the matrix multiplication is running in the GPU. This is of course very bad for performance...

Correlation between cpu threads and n-gpu-layers

The OpenCL backend hooks into the CPU backend and takes control of the execution of some ops. Other backends implement the ggml-backend interface, and the CPU backend isn't running at...

Correlation between cpu threads and n-gpu-layers

The CPU and GPU backends do not work in parallel, they process each a different part of the model each in sequence. Generally what happens is that the input layer...

Correlation between cpu threads and n-gpu-layers

I don't think there is anything about OpenCL that would prevent creating a better backend implementation, it just hasn't been updated much since it was added to ggml.

[BUG] CUDA error: invalid argument when load qwen1.5-72b-q2_k model on 4090

I tried to test this with this same model, but I am not able to reproduce this on my system with 3090+3080. Maybe running with `compute-sanitizer` will show more details.

Is the --ignore-eos flag redundant?

When I made the PR for --ignore-eos the code that ignores eos in interactive mode wasn't added yet. However I think that my solution is better because it avoids sampling...

Is the --ignore-eos flag redundant?

> We could then remove this flag (or maybe it has other uses so we could keep it) I still find it useful outside of interactive mode to force the...

Get chat_template from a server endpoint.

There are functions in the llama.h API to read the metadata. It should work with any non-array metadata. https://github.com/ggerganov/llama.cpp/blob/8084d554406b767d36b3250b3b787462d5dd626f/llama.h#L357-L367

Segmentation Fault Error "not enough space in the context's memory pool"

It would really help to diagnose this if you are able to reproduce it with one of the examples in this repository. If that's not possible, I would suggest looking...

WebAssembly and emscripten headers

You may be thinking of a [library](https://www.npmjs.com/package/@huggingface/gguf) that Huggingface released that can read GGUF *metadata* without downloading the whole file. You wouldn't gain much from streaming the model for inference,...