agents icon indicating copy to clipboard operation
agents copied to clipboard

Server reports no available agents despite CPU utilization being under 20%

Open toonverbeek opened this issue 7 months ago • 3 comments

I've deployed a livekit agent server to Render.com, running in Docker. The server is configured with 4GB of RAM and 2 CPUs.

The Livekit server was not spawning new agents and was reporting "worker is at full capacity, marking as unavailable".

During that time the CPU utilization spiked to 17%, well under the default load threshold of 0.75.

toonverbeek avatar Apr 25 '25 15:04 toonverbeek

Could it be because we're not correctly reading the cgroup CPU usage? https://github.com/livekit/agents/blob/0d608925d3981f9c3734b922ea8053063e6b35ef/livekit-agents/livekit/agents/utils/hw/cpu.py#L77

theomonnom avatar May 03 '25 23:05 theomonnom

Could it be because we're not correctly reading the cgroup CPU usage?

It's possible if this is cgroups v1, but in that case it'd read the entire machine's CPU, so it would overcommit instead of under commit.

@toonverbeek what is the docker base image and platform?

davidzhao avatar May 04 '25 05:05 davidzhao

@davidzhao Thanks for getting back to me. We're running on python:3.11.6-slim: https://hub.docker.com/layers/library/python/3.11.6-slim/images/sha256-f07bce5332359289dbfb906c9f8007a08f59a411404dcef9be4580b77e0f6951

toonverbeek avatar May 05 '25 06:05 toonverbeek

Hey @toonverbeek , Did you find any leads regarding this issue? I am also facing the same.

jitendrakumar025 avatar Jun 30 '25 20:06 jitendrakumar025

Hey @toonverbeek , Did you find any leads regarding this issue? I am also facing the same.

No, nothing new on my side unfortunately.

toonverbeek avatar Jul 01 '25 11:07 toonverbeek

Hi, when the /sys/fs/cgroup/cpu.max path is not available, it incorrectly assumes 1 cpu core when calculating load via the cgroup method.

This is a bug. Possibly better to just fall back to the psutil in that case, or check the cpuset.cpus.effective sys file instead:

cat /sys/fs/cgroup/cpuset.cpus.effective 
0-15

michael-salient avatar Aug 19 '25 23:08 michael-salient

@michael-salient would you like to open a PR with the proposed change?

davidzhao avatar Aug 22 '25 04:08 davidzhao

@davidzhao, I was mistaken. It doesn't read 1cpu, it reads max quota, which will overcommit, not undercommit.

Still, if interested in finding the other places where cpu.max could be, there is this PR: https://github.com/livekit/agents/pull/3239

michael-salient avatar Aug 22 '25 23:08 michael-salient

Hello, I'm having the same problem. When I have more than four simultaneous rooms, I get the message "Worker is at full capacity, marking as unavailable."

In production we would like to reach 50 users without disconnection or lag problems.

Does anyone have any leads?

Thanks

ANTONF31 avatar Aug 28 '25 08:08 ANTONF31