Pierrick Hymbert

Results 98 comments of Pierrick Hymbert

Embeddings model are different from the generative ones. In an RAG setup you need two models. Prometheus is not required but if present metrics are exported.

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. There is nothing to do with the server code.

> @phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer? Firstly, it's better...

> ive made a pull request. The PR is on my fork: https://github.com/phymbert/llama.cpp/pull/7 We need to bring it here somehow

Neither k8s, thought) help welcomed

> why has this not been merged? It requires additional fixes to improve helm charts. You are welcome to improve the infra code mentioned above and finally submit a PR....

Cannot reproduce on a single GPU: ```shell llama-cli --hf-repo Qwen/Qwen2-57B-A14B-Instruct-GGUF --hf-file qwen2-57b-a14b-instruct-q3_k_m.gguf -p "Beijing is the capital of" -n 64 -c 4096 ``` Output ``` /home/phymbert/workspaces/llama.cpp/cmake-build-debug/bin/llama-cli --hf-repo Qwen/Qwen2-57B-A14B-Instruct-GGUF --hf-file qwen2-57b-a14b-instruct-q3_k_m.gguf...

> It seems to be the same issue as the following issues: #2835 not related... it's about gpu split from 2 to 4 H100, no any python stacktrace. But thanks...