llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

kubernetes example

Open phymbert opened this issue 1 year ago • 16 comments

Motivation

Kubernetes is widely used in the industry to deploy product and application at scale.

It can be useful for the community to have a llama.cpp helm chart for the server.

I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:

https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes

References

  • #6545

phymbert avatar Apr 08 '24 16:04 phymbert

Hi! I will take this up!

OmegAshEnr01n avatar Apr 10 '24 01:04 OmegAshEnr01n

Great @OmegAshEnr01n , few notes:

  • I think we need 2 subcharts, one for embeddings, one for generation/completions
  • probably need to update the schema in my branch as now the model will be downloaded by the server directly, and the related Job should be removed
  • need to support both HF url parameters and raw url for internal model repo like artifactory
  • metrics scrapping must work for prometheus community (with the resourcePodMonitoring), enterprise and ideally dynatrace
  • pvc must stay after the helm is un-installed
  • auto scalling can be done later on, but this is a must have
  • ideally the helm must be built by the CI and installable from gh-pages

Ping here if you have question, good luck ! Excited to use it.

phymbert avatar Apr 10 '24 08:04 phymbert

Hi @OmegAshEnr01n, are you still working on this issue ?

phymbert avatar Apr 16 '24 10:04 phymbert

Yes, still am. Will share a pull request over the weekend when completed.

OmegAshEnr01n avatar Apr 17 '24 02:04 OmegAshEnr01n

Hi @phymbert

What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      {{- range $i, $container := .Values.containers }}
      - name: my-container-{{ $i }}
        image: {{ $container.image }}
        volumeMounts:
        - name: data-volume-{{ $i }}
          mountPath: /data
      {{- end }}
      volumes:
      {{- range $i, $container := .Values.containers }}
      - name: data-volume-{{ $i }}
        persistentVolumeClaim:
          claimName: pvc-{{ $i }}
      {{- end }}

On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well.

OmegAshEnr01n avatar Apr 25 '24 02:04 OmegAshEnr01n

Embeddings model are different from the generative ones. In an RAG setup you need two models.

Prometheus is not required but if present metrics are exported.

phymbert avatar Apr 25 '24 12:04 phymbert

Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong.

OmegAshEnr01n avatar Apr 27 '24 16:04 OmegAshEnr01n

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. There is nothing to do with the server code.

phymbert avatar Apr 27 '24 16:04 phymbert

@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀

ceddybi avatar May 03 '24 16:05 ceddybi

Not yet. Currently testing it on a personal kube cluster with separate node selectors.

OmegAshEnr01n avatar May 05 '24 14:05 OmegAshEnr01n

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

/cc @mcharytoniuk

Perdjesk avatar Jun 20 '24 12:06 Perdjesk

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

Thanks the mention. I maintain that point. Of course round robin will work. "Least connections" will be better (but it does not have to reflect how many slots are being used), but the issue is - prompts can take a long, varying time to finish. With round robin it is very possible to distribute the load unevenly (for example if one of the servers was unlucky and is still processing a few of huge prompts). To me the ideal is balancing based on slots and have some requests queue on top of that (which I plan to add to paddler btw :)). I love the slots idea because they make the infra really predictable.

mcharytoniuk avatar Jun 20 '24 13:06 mcharytoniuk

@phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

Firstly, it's better to use native llama.cpp KV cache, so if you have k8s nodes with 2-4 A/H100, having one pod per node using all VRAM and as many as possible slots/cache for the server will give you the maximum performance, but not HA. Then, regarding load balancing, I tested both IP affinity, rb and least conn., no significant differences found. I think it depends of the dataset/usecase or client distribution.

Maybe an interesting approach would be to prioritize upfront based on input tokens size. Nonetheless you cannot predict output tokens size.

I mainly faced issues with long living http connections, IMHO we need a better architecture for this than SSE.

phymbert avatar Jul 07 '24 17:07 phymbert

@phymbert ive made a pull request.

OmegAshEnr01n avatar Jul 21 '24 12:07 OmegAshEnr01n

ive made a pull request.

The PR is on my fork:

https://github.com/phymbert/llama.cpp/pull/7

We need to bring it here somehow

phymbert avatar Jul 22 '24 16:07 phymbert

Hope to meet soon

anencore94 avatar Aug 08 '24 09:08 anencore94

Was this dropped?

Lutherwaves avatar Mar 28 '25 15:03 Lutherwaves

Neither k8s, thought) help welcomed

phymbert avatar Mar 29 '25 07:03 phymbert

why has this not been merged?

nmwael avatar Aug 27 '25 05:08 nmwael

why has this not been merged?

It requires additional fixes to improve helm charts. You are welcome to improve the infra code mentioned above and finally submit a PR.

Probably, users with kubernetes setup are GPU rich. I believe llama.cpp spirit is to focus on the on-device/edge deployment AFAICT.

phymbert avatar Aug 30 '25 07:08 phymbert

I have a (very opinionated) helm chart here. It requires the KubeElasti, LiteLLM and the prometheus operator installed.

You can add multiple models deployed with llama.cpp and it will update the LiteLLM configmap. It also will allow 'scale to zero' with KubeElasti and defaults to scale to zero if llamacpp:requests_processing prometheus metric is less than 1. This is my attempt to make a kubernetes alternative to llama swap.

It's implemented here and designed specifically for my homelab cluster with two strix halo machines. It allows me to have a collection of models available and cached, but only scaled up when requests come to the underlying llama.cpp containers.

In theory, it's agnostic to llama.cpp, but I haven't tried any other llm inference runtimes.

I am currently running into a problem where if I force restart a deployment while an inference job is running, the inference continues. In most cases, this is a desired behavior. However, I want to force stop those jobs. I'll keep searching for a true 'kill switch'... This is similar to this.

Hopefully this provides some inspiration; I think my chart is too opinionated for the llama.cpp repo, but if there's interest, I can create a separate repo for it.

blake-hamm avatar Oct 27 '25 02:10 blake-hamm