Server reports no available agents despite CPU utilization being under 20%
I've deployed a livekit agent server to Render.com, running in Docker. The server is configured with 4GB of RAM and 2 CPUs.
The Livekit server was not spawning new agents and was reporting "worker is at full capacity, marking as unavailable".
During that time the CPU utilization spiked to 17%, well under the default load threshold of 0.75.
Could it be because we're not correctly reading the cgroup CPU usage? https://github.com/livekit/agents/blob/0d608925d3981f9c3734b922ea8053063e6b35ef/livekit-agents/livekit/agents/utils/hw/cpu.py#L77
Could it be because we're not correctly reading the cgroup CPU usage?
It's possible if this is cgroups v1, but in that case it'd read the entire machine's CPU, so it would overcommit instead of under commit.
@toonverbeek what is the docker base image and platform?
@davidzhao Thanks for getting back to me. We're running on python:3.11.6-slim: https://hub.docker.com/layers/library/python/3.11.6-slim/images/sha256-f07bce5332359289dbfb906c9f8007a08f59a411404dcef9be4580b77e0f6951
Hey @toonverbeek , Did you find any leads regarding this issue? I am also facing the same.
Hey @toonverbeek , Did you find any leads regarding this issue? I am also facing the same.
No, nothing new on my side unfortunately.
Hi, when the /sys/fs/cgroup/cpu.max path is not available, it incorrectly assumes 1 cpu core when calculating load via the cgroup method.
This is a bug. Possibly better to just fall back to the psutil in that case, or check the cpuset.cpus.effective sys file instead:
cat /sys/fs/cgroup/cpuset.cpus.effective
0-15
@michael-salient would you like to open a PR with the proposed change?
@davidzhao, I was mistaken. It doesn't read 1cpu, it reads max quota, which will overcommit, not undercommit.
Still, if interested in finding the other places where cpu.max could be, there is this PR: https://github.com/livekit/agents/pull/3239
Hello, I'm having the same problem. When I have more than four simultaneous rooms, I get the message "Worker is at full capacity, marking as unavailable."
In production we would like to reach 50 users without disconnection or lag problems.
Does anyone have any leads?
Thanks