Georgi Gerganov

Results 113 issues of Georgi Gerganov

In some cases, this call can take significant amount of time. We should add an option to be able to provide a callback that will be used to check if...

enhancement
good first issue

https://github.com/ggerganov/ggml/blob/b98cd8689f74ed69432323ef5a15369d96086ae1/include/ggml/ggml.h#L415-L420

good first issue
refactoring

The [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing....

help wanted
refactoring
server/webui

ref #3365 Setting up what's needed for Flash Attention support in `ggml` and `llama.cpp` The proposed operator performs: ```c++ // new res = ggml_flash_attn(ctx, q, k, v, kq_mask, kq_scale); //...

performance
need feedback

The `n_ctx` name is causing some confusion since it's actual meaning is the size of the KV cache, while `n_ctx_train` is the training context of the model This change fixes...

breaking change
refactoring

I did the following test to tokenize `wiki.test.raw` using our tokenizer and the Python tokenizer. The expectation is that the outputs will match: ```bash # generate ggml-vocab-falcon.gguf ./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/...

The [convert-llama2c-to-ggml](https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml) is mostly functional, but can use some maintenance efforts. It also needs an update to support the `n_head_kv` parameter, required for multi-query models (e.g. [stories260K](https://huggingface.co/karpathy/tinyllamas/blob/main/stories260K/readme.md)). Here is quick'n'dirty...

good first issue
testing
refactoring

Automated changes by the [update-flake-lock](https://github.com/DeterminateSystems/update-flake-lock) GitHub Action. ``` Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10) → 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19) ``` ### Running GitHub Actions on this PR GitHub...

nix

ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602 We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could...

refactoring
stale

Recently, initial Mamba support (CPU-only) has been introduced in #5328 by @compilade In order to support running these models efficiently on the GPU, we seem to be lacking kernel implementations...

enhancement
help wanted
Nvidia GPU