llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
### Git commit https://github.com/ggml-org/llama.cpp Branch:master ### Operating systems Windows ### GGML backends CUDA ### Problem description & steps to reproduce Ref: 1. https://gorilla.cs.berkeley.edu/blogs/5_how_to_gorilla.html#integrate-third-party 2. https://github.com/ggml-org/llama.cpp Step V, B(5b) Command: Run...
### Prerequisites - [X] I am running the latest code. Mention the version if possible as well. - [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md). - [X] I searched using keywords...
Misc. bug: RPC attempt fails with a specific error, but I cannot find any info on troubleshooting it
### Name and Version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes version: 4735 (73e2ed3ce) built...
- Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to llama.cpp server...
This commit adjusts the indentation for the functions `parse_sequence` and `parse_rule` in src/llama-grammar.cpp. The motivation is consistency and improve readability.
### Prerequisites - [x] I am running the latest code. Mention the version if possible as well. - [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md). - [x] I searched using keywords...
### Name and Version `llama-server --version` ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes...
### Name and Version llama.cpp-b3999 ### Operating systems Windows ### GGML backends CUDA ### Hardware 2x RTX 3090 i7-7820X ### Models cgato/Nemo-12b-Humanize-KTO-v0.1 bartowski/Nemo-12b-Humanize-KTO-v0.1-GGUF ### Problem description & steps to reproduce...
Currently small models like qwen2.5 0.5B does not work properly with OpenCL backend. This PR fixes this issue. This PR also changes subgroup size to 64 for all Adreno GPUs.