Lukas Kreussel
Lukas Kreussel
The `llama.cpp:light` docker image exits with exitcode 132 when loading the model on both my AMD based systems. Hinting at a missing cpu instruction. If i try to run the...
When i try to install and use this package via a requirements file in the default 3.10 python container i get the following error when i try to import the...
Setting up cuda-toolkit on a normal windows github actions runner can take >>15 min. I tried to speed this up by using the network installer and specify only the sub-packages...
### Feature request Since i spotted [bert_quant.rs](https://github.com/huggingface/text-embeddings-inference/blob/main/backends/candle/src/models/bert_quant.rs) in the candle backend i was curious if it is currently possible to point the embedding server to a "*.gguf" file and load...
Fixes #247 Since we now depend on `pyo3` in `core` we need to include `libpython` in our runtime container. Maybe we could put this `pyo3` dependency behind a feature flag...
**Describe the bug** If two requests are sent to the server at roughly the same time, it will start to respond to both requests and then crash with the following...
Right now the data of a tensor isn't freed if it is offloaded to a GPU. We should fix that to enable users to run bigger models which are split...
I'm currently facing an issue where the generation on a gpu sometimes slows down and its very hard to determine why. (see https://github.com/rustformers/llm/pull/325) It would be great if we could...
As pointed out in https://github.com/rustformers/llm/pull/291, the quality of embeddings produced by the models at present appears to be suboptimal. Our current approach uses the embedding of the final token as...
**Please describe the feature you want** It would be a nice addition if the GitLab host were configurable, to easily point the Tabby server to a self-hosted GitLab instance. **Additional...