Pierrick Hymbert comments

Results 98 comments of


                                            Pierrick Hymbert

kubernetes example

Embeddings model are different from the generative ones. In an RAG setup you need two models. Prometheus is not required but if present metrics are exported.

kubernetes example

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. There is nothing to do with the server code.

kubernetes example

> @phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer? Firstly, it's better...

kubernetes example

> ive made a pull request. The PR is on my fork: https://github.com/phymbert/llama.cpp/pull/7 We need to bring it here somehow

kubernetes example

> why has this not been merged? It requires additional fixes to improve helm charts. You are welcome to improve the infra code mentioned above and finally submit a PR....

Bug: Failed to run qwen2-57b-a14b-instruct-fp16.

Cannot reproduce on a single GPU: ```shell llama-cli --hf-repo Qwen/Qwen2-57B-A14B-Instruct-GGUF --hf-file qwen2-57b-a14b-instruct-q3_k_m.gguf -p "Beijing is the capital of" -n 64 -c 4096 ``` Output ``` /home/phymbert/workspaces/llama.cpp/cmake-build-debug/bin/llama-cli --hf-repo Qwen/Qwen2-57B-A14B-Instruct-GGUF --hf-file qwen2-57b-a14b-instruct-q3_k_m.gguf...

Model warmup fails after adding Triton indexing kernels

> It seems to be the same issue as the following issues: #2835 not related... it's about gpu split from 2 to 4 H100, no any python stacktrace. But thanks...