Georgi Gerganov
Georgi Gerganov
In some cases, this call can take significant amount of time. We should add an option to be able to provide a callback that will be used to check if...
https://github.com/ggerganov/ggml/blob/b98cd8689f74ed69432323ef5a15369d96086ae1/include/ggml/ggml.h#L415-L420
The [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing....
ref #3365 Setting up what's needed for Flash Attention support in `ggml` and `llama.cpp` The proposed operator performs: ```c++ // new res = ggml_flash_attn(ctx, q, k, v, kq_mask, kq_scale); //...
The `n_ctx` name is causing some confusion since it's actual meaning is the size of the KV cache, while `n_ctx_train` is the training context of the model This change fixes...
I did the following test to tokenize `wiki.test.raw` using our tokenizer and the Python tokenizer. The expectation is that the outputs will match: ```bash # generate ggml-vocab-falcon.gguf ./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/...
The [convert-llama2c-to-ggml](https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml) is mostly functional, but can use some maintenance efforts. It also needs an update to support the `n_head_kv` parameter, required for multi-query models (e.g. [stories260K](https://huggingface.co/karpathy/tinyllamas/blob/main/stories260K/readme.md)). Here is quick'n'dirty...
Automated changes by the [update-flake-lock](https://github.com/DeterminateSystems/update-flake-lock) GitHub Action. ``` Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10) → 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19) ``` ### Running GitHub Actions on this PR GitHub...
ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602 We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could...
Recently, initial Mamba support (CPU-only) has been introduced in #5328 by @compilade In order to support running these models efficiently on the GPU, we seem to be lacking kernel implementations...