Georgi Gerganov issues

Results 136 issues of


                                            Georgi Gerganov

ggml : add mechanism to abort `ggml_graph_compute()`

In some cases, this call can take significant amount of time. We should add an option to be able to provide a callback that will be used to check if...

enhancement

good first issue

ggml : remove `src0` and `src1` from `ggml_tensor` and rename `opt` to `src`

https://github.com/ggerganov/ggml/blob/b98cd8689f74ed69432323ef5a15369d96086ae1/include/ggml/ggml.h#L415-L420

good first issue

refactoring

server : improvements and maintenance

103

The [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing....

help wanted

refactoring

server/webui

ggml : add Flash Attention

118

ref #3365 Setting up what's needed for Flash Attention support in `ggml` and `llama.cpp` The proposed operator performs: ```c++ // new res = ggml_flash_attn(ctx, q, k, v, kq_mask, kq_scale); //...

performance

need feedback

llama : rename n_ctx to kv_size

The `n_ctx` name is causing some confusion since it's actual meaning is the size of the KV cache, while `n_ctx_train` is the training context of the model This change fixes...

breaking change

refactoring

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

I did the following test to tokenize `wiki.test.raw` using our tokenizer and the Python tokenizer. The expectation is that the outputs will match: ```bash # generate ggml-vocab-falcon.gguf ./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/...

llama : update the convert-llama2c-to-ggml example

The [convert-llama2c-to-ggml](https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml) is mostly functional, but can use some maintenance efforts. It also needs an update to support the `n_head_kv` parameter, required for multi-query models (e.g. [stories260K](https://huggingface.co/karpathy/tinyllamas/blob/main/stories260K/readme.md)). Here is quick'n'dirty...

good first issue

testing

refactoring

nix: update flake.lock

Automated changes by the [update-flake-lock](https://github.com/DeterminateSystems/update-flake-lock) GitHub Action. ``` Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10) → 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19) ``` ### Running GitHub Actions on this PR GitHub...

nix

ggml : become thread-safe

ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602 We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could...

refactoring

stale

ggml : add GPU support for Mamba models

Recently, initial Mamba support (CPU-only) has been introduced in #5328 by @compilade In order to support running these models efficiently on the GPU, we seem to be lacking kernel implementations...

enhancement

help wanted

Nvidia GPU