kubernetes example
Motivation
Kubernetes is widely used in the industry to deploy product and application at scale.
It can be useful for the community to have a llama.cpp helm chart for the server.
I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:
https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes
References
- #6545
Hi! I will take this up!
Great @OmegAshEnr01n , few notes:
- I think we need 2 subcharts, one for embeddings, one for generation/completions
- probably need to update the schema in my branch as now the model will be downloaded by the server directly, and the related Job should be removed
- need to support both HF url parameters and raw url for internal model repo like artifactory
- metrics scrapping must work for prometheus community (with the resourcePodMonitoring), enterprise and ideally dynatrace
- pvc must stay after the helm is un-installed
- auto scalling can be done later on, but this is a must have
- ideally the helm must be built by the CI and installable from gh-pages
Ping here if you have question, good luck ! Excited to use it.
Hi @OmegAshEnr01n, are you still working on this issue ?
Yes, still am. Will share a pull request over the weekend when completed.
Hi @phymbert
What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
{{- range $i, $container := .Values.containers }}
- name: my-container-{{ $i }}
image: {{ $container.image }}
volumeMounts:
- name: data-volume-{{ $i }}
mountPath: /data
{{- end }}
volumes:
{{- range $i, $container := .Values.containers }}
- name: data-volume-{{ $i }}
persistentVolumeClaim:
claimName: pvc-{{ $i }}
{{- end }}
On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well.
Embeddings model are different from the generative ones. In an RAG setup you need two models.
Prometheus is not required but if present metrics are exported.
Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong.
Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. There is nothing to do with the server code.
@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀
Not yet. Currently testing it on a personal kube cluster with separate node selectors.
@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:
Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.
From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?
/cc @mcharytoniuk
@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:
Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.
Thanks the mention. I maintain that point. Of course round robin will work. "Least connections" will be better (but it does not have to reflect how many slots are being used), but the issue is - prompts can take a long, varying time to finish. With round robin it is very possible to distribute the load unevenly (for example if one of the servers was unlucky and is still processing a few of huge prompts). To me the ideal is balancing based on slots and have some requests queue on top of that (which I plan to add to paddler btw :)). I love the slots idea because they make the infra really predictable.
@phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?
Firstly, it's better to use native llama.cpp KV cache, so if you have k8s nodes with 2-4 A/H100, having one pod per node using all VRAM and as many as possible slots/cache for the server will give you the maximum performance, but not HA. Then, regarding load balancing, I tested both IP affinity, rb and least conn., no significant differences found. I think it depends of the dataset/usecase or client distribution.
Maybe an interesting approach would be to prioritize upfront based on input tokens size. Nonetheless you cannot predict output tokens size.
I mainly faced issues with long living http connections, IMHO we need a better architecture for this than SSE.
@phymbert ive made a pull request.
ive made a pull request.
The PR is on my fork:
https://github.com/phymbert/llama.cpp/pull/7
We need to bring it here somehow
Hope to meet soon
Was this dropped?
Neither k8s, thought) help welcomed
why has this not been merged?
why has this not been merged?
It requires additional fixes to improve helm charts. You are welcome to improve the infra code mentioned above and finally submit a PR.
Probably, users with kubernetes setup are GPU rich. I believe llama.cpp spirit is to focus on the on-device/edge deployment AFAICT.
I have a (very opinionated) helm chart here. It requires the KubeElasti, LiteLLM and the prometheus operator installed.
You can add multiple models deployed with llama.cpp and it will update the LiteLLM configmap. It also will allow 'scale to zero' with KubeElasti and defaults to scale to zero if llamacpp:requests_processing prometheus metric is less than 1. This is my attempt to make a kubernetes alternative to llama swap.
It's implemented here and designed specifically for my homelab cluster with two strix halo machines. It allows me to have a collection of models available and cached, but only scaled up when requests come to the underlying llama.cpp containers.
In theory, it's agnostic to llama.cpp, but I haven't tried any other llm inference runtimes.
I am currently running into a problem where if I force restart a deployment while an inference job is running, the inference continues. In most cases, this is a desired behavior. However, I want to force stop those jobs. I'll keep searching for a true 'kill switch'... This is similar to this.
Hopefully this provides some inspiration; I think my chart is too opinionated for the llama.cpp repo, but if there's interest, I can create a separate repo for it.