self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Improve cpu_count by os.sched_getaffinity

Open PMExtra opened this issue 2 years ago • 10 comments

Problem Statement

I deployed self-hosted Sentry on a large-scale server. We split the server into many LXC containers. Then make docker nested in the LXC containers.

For example, I have a 112 cores physical server. Then I have a LXC container with 8 cores CPU limit in the server. Then I deployed self-hosted Sentry in the LXC container.

The problem is that Sentry started so many workers that runs out so many CPU and memory.

I found that:

  • Sentry thinks the CPU count is 112 rather than 8, so it started 112 worker processes.
  • docker exec sentry-worker-1 python3 -c "from multiprocessing import cpu_count; print(cpu_count())" returns 112.
  • docker exec sentry-worker-1 python3 -c "import os; print(os.cpu_count())" returns 112.
  • docker exec sentry-worker-1 python3 -c "import os; print(len(os.sched_getaffinity(0)))" returns 8.
  • docker exec sentry-worker-1 nproc returns 8.

References:

  • https://bugs.python.org/issue36054

Solution Brainstorm

I think we should replace all of cpu_count() with len(os.sched_getaffinity(0)).

Product Area

Performance

PMExtra avatar May 05 '23 03:05 PMExtra

Assigning to @getsentry/support for routing, due by (sfo). ⏲️

getsantry[bot] avatar May 05 '23 03:05 getsantry[bot]

Quick lookup here, looks like we may run into issues since os.sched_getaffinity(0) is not available on some unix platforms such as OSX

hubertdeng123 avatar May 08 '23 16:05 hubertdeng123

https://stackoverflow.com/a/55423170/6401252

Can we check if sched_getaffinity is available, otherwise fallback to cpu_count

PMExtra avatar May 09 '23 02:05 PMExtra

As a workaround, could you add a parameter in your docker-compose.yml or docker-compose.override.yml file to include a parameter for concurrency for your worker? Seems like your use case might be pretty specific so wondering if this will work for you

https://github.com/getsentry/sentry/blob/master/src/sentry/runner/commands/run.py#L290

hubertdeng123 avatar May 09 '23 19:05 hubertdeng123

@hubertdeng123 Sure, it's my current solution. I've modified the docker-compose.yml and it works.

But I think it would be better if we can make it determine cpu_count correctly, or provide an environment variable to configure the worker_count at least.

Maybe we can do something like:

services:
  worker:
    command: run worker --concurrency ${SENTRY_WORKER_COUNT:-$(nproc)}

PMExtra avatar May 10 '23 01:05 PMExtra

I found that ${SENTRY_WORKER_COUNT:-$(nproc)} can only work with shell (eg. entrypoint.sh). It does not work with command of docker-compose.yml. So I said do something LIKE that.

PMExtra avatar May 10 '23 01:05 PMExtra

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

github-actions[bot] avatar Jun 01 '23 00:06 github-actions[bot]

I revisited this question. The core issue is that Python does not consider the system information from cgroups, while nproc does.

> cat /sys/fs/cgroup/cpuset.cpus.effective 
1,3-4,8,13-14,16,19

PMExtra avatar Jun 01 '23 01:06 PMExtra

I think we should label it Status: Backlog at least. This is indeed a problem, although limited in scope.

PMExtra avatar Jun 01 '23 01:06 PMExtra

@hubertdeng123 Sure, it's my current solution. I've modified the docker-compose.yml and it works.

But I think it would be better if we can make it determine cpu_count correctly, or provide an environment variable to configure the worker_count at least.

Maybe we can do something like:

services:
  worker:
    command: run worker --concurrency ${SENTRY_WORKER_COUNT:-$(nproc)}

@PMExtra could you please explain what exactly have you added to docker-comspose.yml to explicitly tell Sentry the --concurrency value? We're also on LXC and experiencing the same issue.

manyjuice avatar Mar 18 '24 12:03 manyjuice