jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected

Open jankratochvil opened this issue 1 year ago • 30 comments

The testcase requires root permissions.

Designed by Severin Gehwolf, implemented by Jan Kratochvil.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [ ] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17198/head:pull/17198
$ git checkout pull/17198

Update a local copy of the PR:
$ git checkout pull/17198
$ git pull https://git.openjdk.org/jdk.git pull/17198/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17198

View PR using the GUI difftool:
$ git pr show -t 17198

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17198.diff

Webrev

Link to Webrev Comment

jankratochvil avatar Dec 28 '23 12:12 jankratochvil

:wave: Welcome back jkratochvil! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Dec 28 '23 12:12 bridgekeeper[bot]

@jankratochvil The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Dec 28 '23 12:12 openjdk[bot]

ping, are there some concerns during review of this patch? If you don't like too much some part just point at it, I will try to improve it. Thanks.

jankratochvil avatar Jan 11 '24 09:01 jankratochvil

I'm looking at it. Review should be ready this or early next week.

jerboaa avatar Jan 11 '24 14:01 jerboaa

@jankratochvil this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout master-cgroup
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Jan 31 '24 15:01 openjdk[bot]

Walking the directory hierarchy (and interface files) on every call of physical_memory() (or active_processor_count() for CPU) seems concerning.

I have therefore implemented a hierarchical memory.stat extension for cgroup2 which already exists for cgroup1 - if the patch is agreed upon I will submit it to Linux kernel: cgroup2-hierarchical_memory_limit.patch.txt

Since it seems unlikely for cgroup interface file paths changing during runtime, walking the hierarchy for every look-up seems overkill. Note that the prime users of this feature are in-container use, where most runtimes have the limit settings at the leaf.

While I understand the most common change of the limits is at the leaf technically any parent group limit can change. Which is also what this patch is implementing. For the performance issue I have implemented the Linux kernel extension above.

N files per metric to look at where it used to be one file (i.e. constant).

The problem is any of the files can change. But to fix the performance I have implemented the Linux kernel patch above.

I wonder if it would be sufficient to "lock" into the cgroup path where the lowest limits are set in the hierarchy and only use the interface files at that path of a specific controller.

That mostly disables the functionality of this patch.

* There seems to be an inconsistency between cgroups v1 (no recursive look-up) and cgroups v2  (recursive look-up). Why? I think we should employ the same semantics to both  versions (and drop the hierarchical work-around - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg v1).

I did not much expect anyone still uses cgroup1. Plus customer was also interested in cgroup2. AFAIK a last OS using cgroup1 was RHEL-7 which will have EOL in a few months. But then I tried to implement cgroup1 but there it sort of already worked thanks for your JDK-8217338. I just fixed there to prefer the hierarchical limit over the leaf limit (and the testcase does test it now) as that is what is used by Linux kernel.

And then my Linux kernel patch implements the same memory.stat way even for cgroup2.

* There also seems to be an inconsistency between metrics being iterated. `memory.max` and `memory.swap.max`  and `memory.swap.current` are being iterated, others, like CPU quotas (`cpu.max` or
  `cpu.weight`) not. Why? Both limits could potentially be one path up the hierarchy, meaning that cpu limits imposed that way wouldn't be detected.

That is a correct comment. Customer asked only for memory. I can implement cpu but I would like to find out the approved approach first before implementing the same also for cpu.

* What testing have you done on this? Did this run through cgroups v1 and cgroups v2  testing? I see failures in `jdk/internal/platform/cgroup` tests (compilation fails).

Thanks for the catch, I have fixed the failure now. I rely on GHA (GitHub Actions) as running the testsuite on my local machine takes 24 hours with many fuzzy results so it is very time consuming to reliably test a patch.

jankratochvil avatar Jan 31 '24 15:01 jankratochvil

@jankratochvil The "use hierarchical limit" was a work-around that bites us now. So instead we should attempt to unify cgv1 and cgv2 by walking the hierarchy and find the place in the hierarchy where the interface files - with actual limits imposed - are to be found. Since this patch attempts to solve a similar goal to what the old work-around solved for cg v1, we should re-think how to solve it properly. My suggestion would be to go with walking the hierarchy for both: cg v1 and cg v2.

As to the walking the hierarchy on every invocation problem, I don't think fixing it by relying on a kernel patch is the way to go. It'll take time for the ecosystem to pick those up and existing cg v2 systems won't be fixed (consider RHEL 8 or RHEL 9 systems and their derivatives). Therefore, the idea would be to do the iteration once when the OSContainer code initializes. Specifically, in OSContainer::init() or CgroupSubsystemFactory::create(). There is the off-chance that somebody "migrating" a JVM process from one cgroup hierarchy to another, but I'm not sure this is something we need to support. Restarting the JVM seems OK. After all, the GC and other subsystems aren't really ready for dynamic updates anyway. We will address that problem when it actually becomes one.

Once CgroupSubsystemFactory::create() is done, the individual controllers would return the correct path to the interface file by asking CgroupController::subsystem_path() where it is. This also has a nice symmetry with memory vs. cpu and other controllers. All of them could have their own subsystem_path().

This would shift the iterating-over-paths to init time and would then keep them fixed, addressing the performance concern.

Overall this would reduce complexity IMO and would also fit well with a solution I'm working on for JDK-8261242 which we need to solve for other reasons.

jerboaa avatar Feb 06 '24 11:02 jerboaa

@jankratochvil The "use hierarchical limit" was a work-around that bites us now.

What is the current problem? I see one problem of the cgroup1 hierarchical implementation in OpenJDK which is fixed by this patch incl. a testcase.

So instead we should attempt to unify cgv1 and cgv2 by walking the hierarchy

When you are concerned about the performance I have rather implemented the hierarchical interface also for Linux kernel. https://lore.kernel.org/linux-mm/[email protected]/T/ It has been welcomed by SuSE: https://lore.kernel.org/linux-mm/[email protected]/T/#m1238300cf818badebed0183f1b5ab798bf1f2e9f

and find the place in the hierarchy where the interface files - with actual limits imposed - are to be found.

That is being done if the patch above is not found on the running kernel.

Since this patch attempts to solve a similar goal to what the old work-around solved for cg v1, we should re-think how to solve it properly. My suggestion would be to go with walking the hierarchy for both: cg v1 and cg v2.

I hope I have fixed the same problem in a more performance wise way.

As to the walking the hierarchy on every invocation problem, I don't think fixing it by relying on a kernel patch is the way to go.

Why not to support both ways. Time flies like wind.

It'll take time for the ecosystem to pick those up and existing cg v2 systems won't be fixed (consider RHEL 8 or RHEL 9 systems and their derivatives).

Particuarly for RHEL it depends whether any of the paying customers is interested in the performance wise solution (I doubt so as the performance difference is small). Backport for a hotfix delivery for the specific customer will be done in a few days. Included in the next RHEL y-release it will be in a few months (and also released then in CentOS).

Therefore, the idea would be to do the iteration once when the OSContainer code initializes.

Done that in this patch. I think it could also occasionally re-read the hierarchical structure but that hasn't been implemented.

I expect I would remove the "memory.max.effective" dependency from this patch for the OpenJDK commit as the kernel patch has not been approved yet and so the kernel interface ("memory.max.effective") still may change.

jankratochvil avatar Feb 20 '24 15:02 jankratochvil

After internal discussion I have realized the patch has overgrown its intended scope. And one should also consider how easy it would be for a backport down to JDK-8. I am going to split it into:

  1. cgroup1 bugfix to always use kernel-side computed minimum leaf/memory.stat even if leaf/memory.max is not max - present in the current patch - to be split out under a new JDK Bug ID
  2. cgroup1 change from reading leaf/memory.stat to traversal of all-depths/memory.max. - not present in the current patch, you have requested it, its only advantage is it does not reset some of the stats. Is it really needed? Also because cgroup1 is becoming outdated, RHEL7 ends this summer.
  3. cgroup2 bugfix to implement traversal of all-depths/memory.max - present in the current patch - to be tracked under this JDK Bug ID.
  4. cgroup2 kernel-side computed minimum in leaf/memory.max.effective - present in the current patch but not planned to but checked into OpenJDK before the kernel patch gets accepted.

Do you agree?

jankratochvil avatar Feb 21 '24 11:02 jankratochvil

After internal discussion I have realized the patch has overgrown its intended scope. And one should also consider how easy it would be for a backport down to JDK-8. I am going to split it into:

1. cgroup1 bugfix to always use kernel-side computed minimum `leaf/memory.stat` even if `leaf/memory.max` is not `max` - present in the current patch - to be split out under a new JDK Bug ID

2. cgroup1 change from reading `leaf/memory.stat` to traversal of `all-depths/memory.max`. - not present in the current patch, you have requested it, its only advantage is it does not reset some of the stats. Is it really needed? Also because cgroup1 is becoming outdated, RHEL7 ends this summer.

3. cgroup2 bugfix to implement traversal of `all-depths/memory.max` - present in the current patch - to be tracked under this JDK Bug ID.

4. cgroup2 kernel-side computed minimum in `leaf/memory.max.effective` - present in the current patch but not planned to but checked into OpenJDK before the kernel patch gets accepted.

Do you agree?

I won't have time to properly review this this week. The latest version seemed to go into the right direction, though. Point 4.) above can only go in after any released kernel supports it to begin with.

One goal of this patch would be to unify how this works for cgroup v1 and cgroup v2. The on-init traversal for both versions makes sense as it solves the same problem (systemd slice) without using the hierarchical limit work-around (covers point 2 and 3). Point 1.) seems orthogonal to this patch, but I'm not sure I understand what it's about to begin with...

jerboaa avatar Feb 21 '24 13:02 jerboaa

One goal of this patch would be to unify how this works for cgroup v1 and cgroup v2.

That is not much possible anyway as currently cgroup2 has to traverse the directories while after the kernel patch gets accepted it will be faster using the new kernel interface.

Point 1.) seems orthogonal to this patch, but I'm not sure I understand what it's about to begin with...

With current JDK if you set limits 1GB for memory:foo, 2GB for memory:foo/bar and you run a process in memory:foo/bar JDK will think the limit is 2GB despite the kernel will enforce a limit of 1GB.

jankratochvil avatar Feb 21 '24 13:02 jankratochvil

One goal of this patch would be to unify how this works for cgroup v1 and cgroup v2.

That is not much possible anyway as currently cgroup2 has to traverse the directories while after the kernel patch gets accepted it will be faster using the new kernel interface.

I think it should be using the "on-init traversal approach" for both.

Point 1.) seems orthogonal to this patch, but I'm not sure I understand what it's about to begin with...

With current JDK if you set limits 1GB for memory:foo, 2GB for memory:foo/bar and you run a process in memory:foo/bar JDK will think the limit is 2GB despite the kernel will enforce a limit of 1GB.

I'm surprised that the hierarchical memory limit look-up wouldn't see it for v1. Either way, using the traversal approach you'd see that foo/bar has 2 GB, when traversing up the hierarchy you'd see 1 GB for foo and use that as it's less than the earlier value. So would be covered by this fix? What am I missing?

jerboaa avatar Feb 21 '24 14:02 jerboaa

I'm surprised that the hierarchical memory limit look-up wouldn't see it for v1. Either way, using the traversal approach you'd see that foo/bar has 2 GB, when traversing up the hierarchy you'd see 1 GB for foo and use that as it's less than the earlier value. So would be covered by this fix? What am I missing?

It will be needlessly slower and it will not notice during execution changes of limits on other levels of the cgroup hierarchy than the initial chosen one.

jankratochvil avatar Feb 21 '24 14:02 jankratochvil

⚠️ @jankratochvil This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar Mar 10 '24 08:03 openjdk[bot]

I have tried a backport to JDK-8 and it looks backportable. jdk8-cgroup.patch.txt

jankratochvil avatar Mar 10 '24 12:03 jankratochvil

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Mar 13 '24 20:03 openjdk[bot]

@jankratochvil This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Apr 11 '24 01:04 bridgekeeper[bot]

ping, it is already 6+ weeks since last change

jankratochvil avatar Apr 25 '24 03:04 jankratochvil

Thanks for the updates. I like that we have consistency between cgv1 and cgv2 in the latest version in terms of hierarchical limit. What would be even better is to get consistency between CPU and memory lookup (if the restriction is enforced higher up the hierarchy). That is, it would be ideal to make initialize_hierarchy() controller specific.

Meanwhile I've been working on some refactoring which builds on top of JDK-8302744 so as to make the code a bit nicer once this integrates. Then, the idea would be to use scratch controllers (CgroupCpuController and CgroupMemoryController) to determine whether or not there is a limit and figure out the actual path on a per-controller specific way - (use CgroupMemoryController->read_memory_limit_in_bytes(phys_mem) and CgroupUtil::processor_count(CgroupCpuController* cpu_ctrl, int host_cpus) in the process). Does that make sense?

A few other observations:

  • The common case is when the JVM runs in a container. Then, the cgroup path is / on cgv2 and the and root_mount == cgroup_path on cgv1. We don't need to do the extra processing on those systems as the limit will be at the leaf.
  • The (fairly) uncommon case is the host case where the cgroup limit is applied elsewhere (e.g. systemd slice). This is where we'd need the hierarchy walk.
  • When we need to walk the hierarchy, we start at the longest path and only traverse if there is NO limit. A system which sets a higher, limit (that isn't max), seems ill-defined and I've not come across one. As soon as we've found a lower value than unlimited (-1), we stop. Since cg2 is hierarchical, the lowest limit will affect the entire tree (corollary: higher values further down from that point won't have an effect):
    /a/b --> memory.max 300
       `- /c --> memory.max max (this wouldn't have any effect, therefore can be ignored).
    

jerboaa avatar Apr 25 '24 13:04 jerboaa

  • Will the patch be accepted only for memory or it has to support also CPU?
  • Should I code it on top of OpenJDK trunk or on top of https://github.com/jerboaa/jdk/commit/92aaa6fd7e3ff8b64de064fecfcd725a157cb5bb ?

jankratochvil avatar Apr 30 '24 07:04 jankratochvil

  • Will the patch be accepted only for memory or it has to support also CPU?

It should be fine for memory only for a start, but we should allow for on-boarding of other controllers as well.

Since this patch solves a similar problem than https://bugs.openjdk.java.net/browse/JDK-8217338 did at the time when only cg v1 was supported I'd feel more comfortable in paying the technical debt and recode it so that cg v1 and cg v2 behaves the same or at least similar. That should make the code cleaner, easier to test and maintain in the long run. So I'm thinking on top of the refactoring. I'll try to prioritize those work items accordingly. Does that sound OK?

jerboaa avatar Apr 30 '24 12:04 jerboaa

Should I rebase it on top of the commit https://github.com/jerboaa/jdk/commit/92aaa6fd7e3ff8b64de064fecfcd725a157cb5bb or wait until you finish it?

jankratochvil avatar Apr 30 '24 12:04 jankratochvil

Should I rebase it on top of the commit jerboaa@92aaa6f or wait until you finish it?

Up to you. I'll keep you posted once the PRs are out.

jerboaa avatar Apr 30 '24 13:04 jerboaa

Should JDK still support memory.use_hierarchy == 0? That has been already removed from Linux kernel: https://github.com/torvalds/linux/commit/bef8620cd8e0a117c1a0719604052e424eb418f9 This patch is apparently present even in kernel-3.10.0-1160.118.1.el7.x86_64 (CentOS-7.9 updates) It is not present in kernel-3.10.0-1160.el7.x86_64 (CentOS-7.9 release) Still CentOS-7 is almost EOLed, is there any other distro for cgroup1?

jankratochvil avatar May 07 '24 09:05 jankratochvil

Should JDK still support memory.use_hierarchy == 0?

IMO, no. Just get rid of it and assume hierarchical everywhere. We'd be walking the hierarchy for other (lower limits), which should cover this case on those legacy systems.

jerboaa avatar May 07 '24 09:05 jerboaa

Your patch https://github.com/jerboaa/jdk/commit/92aaa6fd7e3ff8b64de064fecfcd725a157cb5bb#diff-1c49a6b40a810aef82b7da9bfea8f03e07a43062977ba65f75df63c4b7c5b149R89 contains a tab.

jankratochvil avatar May 08 '24 03:05 jankratochvil

@jankratochvil This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jun 19 '24 18:06 bridgekeeper[bot]

Keep open.

jerboaa avatar Jun 19 '24 19:06 jerboaa

@jankratochvil Please merge master and try to use the new code from https://bugs.openjdk.org/browse/JDK-8331560 to query the limits. Thanks!

jerboaa avatar Jul 02 '24 15:07 jerboaa