en/posts/kubecon-china-2025/
LLM Scaling and Load Balancing
I haven't saw the talks yet, but it is an interesting topic. Have you saw Paddler before? It is a AI app builder and load balancer to host LLMs locally. It uses llama.cpp under the hood and proxies requests based on processing slots. It also has HA/service discovery/recovery with and agent which supervises and change model and parameters at real time. Nowdays i see that as the base to create conversational apps and its very promising for infra as it can scale from zero nodes because it has requests buffering. Im an avid contributor there and you can reach out there in the community or even ask me if you have some doubt about the project and experiment it. I would be happy to help.
@Propfend Thanks for the recommendation—it’s interesting. I don’t host any LLM infrastructure in-house yet, so I haven’t had a chance to try things like this.