pylint icon indicating copy to clipboard operation
pylint copied to clipboard

pylint -j0 always uses 1 core on real hardware in Linux

Open dsilakov opened this issue 3 years ago • 7 comments
trafficstars

Bug description

-j0 should automatically detect number of processors available. However, after the changes that introduce _cpu_count() function for calculation available resources in clouds, this works incorrectly on real hardware.

A potential problem lies here: https://github.com/PyCQA/pylint/blob/main/pylint/lint/run.py#L59 If we reach this code (we don't have cpu.cfs_quota_us set) then we always divide /sys/fs/cgroup/cpu/cpu.shares by 1024. Maybe this is correct for AWS, but e.g. in ubuntu 20.04 cpu.shares is set to 1024 by default. So it doesn't matter how many cpu cores do you have, pylint -j0 will always use only one of them.

The whole idea of using cpu.shares in such a way looks strange, since it just declares an allowed "share" of CPU resources for the particular process.

Configuration

Ubuntu 20.04 x86_64, real hw, 12 cores

Command used

$ time pylint -j0 pylint/*
$ time pylint -j1 pylint/*
$ time pylint -j12 pylint/*

Pylint output

$ time pylint -j0 pylint/*
30 sec

$ time pylint -j1 pylint/*
30 sec

$ time pylint -j12 pylint/*
8 sec

Expected behavior

$ time pylint -j0 pylint/* 8 sec

$ time pylint -j1 pylint/* 30 sec

$ time pylint -j12 pylint/* 8 sec

Pylint version

pylint 2.14.5
astroid 2.11.7
Python 3.8.10

OS / Environment

Ubuntu 20.04 x86_64

Additional dependencies

No response

dsilakov avatar Aug 22 '22 15:08 dsilakov

@dsilakov I must say I'm no expert in containers and CPU's so I'll likely need some help fixing this.

Based on discussion in https://bugs.python.org/issue36054 and the implementation of CPU counting in JVM in https://github.com/openjdk/jdk/blob/d5686b87f31d6c57ec6b3e5e9c85a04209dbac7a/src/hotspot/os/linux/os_linux.cpp#L5304-L5336 https://github.com/openjdk/jdk/blob/2d5137e403e16b694800b2ffe18c3640396b757e/src/hotspot/os/linux/osContainer_linux.cpp#L517-L591

Am I correct in thinking that we incorrectly determine whether cgroups is activated? The third step in the JVM process is what we mimic with _cpu_count but I guess in your case we shouldn't even end up there?

DanielNoord avatar Aug 23 '22 07:08 DanielNoord

Related to #6098 and https://github.com/PyCQA/pylint/pull/6903

Pierre-Sassoulas avatar Aug 23 '22 12:08 Pierre-Sassoulas

To be more precise - in JVM, the magic you are referencing to happens inside the OSContainer class and JVM seems to have some trickier logic when detecting if it is running "containerized" (under cgroup control) or not.

Having /sys/fs/cgroup/cpu/cpu.shares and other cpu.shares for particular processes on the host system is completely legit - for example, systemd uses them to limit CPU resources. However I don't see a generic way to transform these "shares" to real cores since this depends on how many other processes do you have. Maybe division by 1024 indeed works inside AWS, but this still looks like some kind of heuristic.

A good explanation is provided for example in https://www.batey.info/cgroup-cpu-shares-for-docker.html:

The cpu_share has no meaning in isolation. For example, setting a cpu_share of 512 gives you no information about how much CPU time the container will get. Setting one container’s cpu_share to 512 and another container’s to 1024 means that the second container will get double the amount of CPU time as the first. That said, it still gives no information about the actual amount of CPU time each container will get, only the relative amount between them.

dsilakov avatar Aug 23 '22 12:08 dsilakov

Heh, after a closer look, JVM also contains a set of hacks for treating cpu.shares equal to 1024:

https://github.com/openjdk/jdk/blob/master/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L278

https://github.com/openjdk/jdk/blob/master/src/java.base/linux/classes/jdk/internal/platform/cgroupv1/CgroupV1Subsystem.java#L247

Honestly I don't know what should be a proper fix for pylint here. For our scenarios, I've just commented out "elif os.path.isfile('/sys/fs/cgroup/cpu/cpu.shares')" part. In our K8s cluster and dockers we do see /sys/fs/cgroup/cpu/cpu.cfs_quota_us so the code still works fine inside the containers. But I can't check AWS for which the problematic elif was intended for.

dsilakov avatar Aug 23 '22 12:08 dsilakov

Yeah, I'm not an user of AWS myself but I'm wondering if there is some other way we can use to detect if we are in an AWS environment instead of what we currently use in the elif.

DanielNoord avatar Aug 23 '22 12:08 DanielNoord

We had a duplicate opened in #9537 and the op said pytest does things correctly. So here's how pytest-xdist does it:

https://github.com/pytest-dev/pytest-xdist/blob/470bc83a253d6a05d2f282e981c61e438dbd0302/src/xdist/plugin.py#L14-L52

Pierre-Sassoulas avatar Apr 04 '24 20:04 Pierre-Sassoulas

We had a duplicate opened in #9537 and the op said pytest does things correctly. So here's how pytest-xdist does it:

https://github.com/pytest-dev/pytest-xdist/blob/470bc83a253d6a05d2f282e981c61e438dbd0302/src/xdist/plugin.py#L14-L52

That solution is still not perfect, if you run it inside docker container than it won't take container limitations into account.

E.g., inside a container launched via docker --cpus 2 ... all functions used in that patch (psutil.cpu_count(), len(sched_getaffinity(0)), multiprocess.cpu_count() will all return the actual number of CPU cores on the host, not 2. So this brings us to the original issue - it was an attempt to fix issue with container limitations that break -j0.

dsilakov avatar Apr 05 '24 20:04 dsilakov