jdk 8292083: Detected container memory limit may exceed physical machine memory

We discovered some systems configured with cgroups v1 which report a bogus container memory limit value which is above the physical memory of the host. OpenJDK then calculates flags such as InitialHeapSize based on this invalid value; this can be larger than the available memory which can result in the OS terminating the process due to OOM.

hotspot's container awareness attempts to sanity check the limit value by ensuring it's below _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size(), but that still leaves a large range of potential invalid values between physical RAM and that ceiling value.

Cgroups V1 in particular returns an uninitialised value for the memory limit when one has not been explicitly set. Cgroups v2 does not suffer the same problem: however, it's possible for any value to be set for the max memory, including values exceeding the available physical memory, in either v1 or v2.

This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans. I'd also look again at whether the existing ceiling value _unlimited_memory serves any useful purpose. I personally don't feel those improvements should hold up this fix.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8292083: Detected container memory limit may exceed physical machine memory

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9880/head:pull/9880
$ git checkout pull/9880

Update a local copy of the PR:
$ git checkout pull/9880
$ git pull https://git.openjdk.org/jdk pull/9880/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9880

View PR using the GUI difftool:
$ git pr show -t 9880

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9880.diff

Aug 15 '22 14:08 jmtd

:wave: Welcome back jdowland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Aug 15 '22 14:08 bridgekeeper[bot]

@jmtd The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Aug 15 '22 14:08 openjdk[bot]

Webrevs

Aug 15 '22 15:08 mlbridge[bot]

Thanks for the review! I agree with all your points.

Aug 17 '22 09:08 jmtd

This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize as that might get ergonomically set.

Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with -XX:InitialRAMPercentage set, InitialHeapSize is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.

Aug 17 '22 09:08 jmtd

This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans.

I've filed https://bugs.openjdk.org/browse/JDK-8292541 for the Java side.

Aug 17 '22 09:08 jerboaa

This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize as that might get ergonomically set.

Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with -XX:InitialRAMPercentage set, InitialHeapSize is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.

Hence my suggestion to check what os::physical_memory() returns via WhiteBox. Many GC heap settings are derived from physical memory. So checking that it never exceeds actual physical host memory should be sufficient.

Aug 17 '22 09:08 jerboaa

@jmtd Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information.

Aug 17 '22 13:08 openjdk-notifier[bot]

It appears os::Linux::available_memory() would need similar treatment, still.

Aug 18 '22 18:08 jerboaa

It appears os::Linux::available_memory() would need similar treatment, still.

I've pushed a simple fix for this as caa7913c8a1, but unfortunately this makes the logging noisiness far worse.

Deeper in the call stack of the cgroups-specific code there is some caching to avoid re-reading and parsing the cgroups filesystem too frequently. I wonder if it would be better to have the sanity checking logic behind the caching. Since I think it's better to have the sanity-checking logic at the level of os::physical_memory, I might see whether the caching can be brought up to that level.

Aug 19 '22 16:08 jmtd

@jmtd this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout 8292083-cgroups-badmaxmem
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Aug 22 '22 14:08 openjdk[bot]

That last fix (6086505) has resolved the increase in logging output too.

Aug 23 '22 09:08 jmtd

@jmtd This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8292083: Detected container memory limit may exceed physical machine memory

Reviewed-by: sgehwolf, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 59 new commits pushed to the master branch:

f91943c19fc0b060684a437d2c768461d54c088e: 8292868: Explicitly pass a third temp register to MacroAssembler::store_heap_oop for aarch64
6354a57b5cb85d31ea70a998202470467402b669: 8290711: assert(false) failed: infinite loop in PhaseIterGVN::optimize
3e187730162965981a5e6d238935e46d1015708e: 8292880: Improve debuggee logging for com/sun/jdi/ClassUnloadEventTest.java
909e1edb188ead748bd452067ca06d6e91aee4c6: 8292919: Build failure due to UseJVMCICompiler was not declared when C2 is disabled after JDK-8292691
55f5a83b88d7259bf7965ff12abd8dff4f79315f: 8282410: Remove SA ProcDebugger support
d83faeaf9ab3584de6af23de16aad3738d179c86: 8292250: Create test for co-located JDI MethodEntry, Step, and Breakpoint events
e353b572a54edbbf0df1f01afa36067500157603: 8292890: Remove PrintTouchedMethodsAtExit and LogTouchedMethods
95a33fe1502b6f0db2c60fa92b389fda74d94407: 8292314: Cleanup legacy address handling
5d799d80e638b85fa3881904e7330ffb5100764a: 8292304: [REDO] JDK-8289208 Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS
4f50316a1a985cd06af7eed158d7e1917b86d159: 8292680: Convert Dictionary to ConcurrentHashTable
... and 49 more: https://git.openjdk.org/jdk/compare/e561933907bbab0a42f1796fa12f582b3a347312...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jerboaa, @iklam, @tstuefe) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Aug 24 '22 09:08 openjdk[bot]

Looks good to me. Please get a second review for this, though, if you can.

Thanks @jerboaa much appreciated.

@tstuefe, you provided some valuable feedback on this PR in an earlier incarnation. Would you be prepared to take another look?

Aug 24 '22 10:08 jmtd

GHA reports that 1/9 tests in linux-x86 / test (hs/tier1 gc) failed. Digging into the logs, this was gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-32, which seems unrelated to these changes. I've just ran this one locally and it passed for me.

Aug 24 '22 15:08 jmtd

Hmm, I get uncomfortable if APIs mix layers and try to be smart.

As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.

In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.

That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.

Aug 25 '22 16:08 tstuefe

Hmm, I get uncomfortable if APIs mix layers and try to be smart.

As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.

In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.

All fair points, but this patch doesn't change the status quo. For example, CgroupSubsystem::active_processor_count() uses os::Linux::active_processor_count() to bound the cores reported to the JVM above by the physical cores (or virtualized cores in a VM). Similarly, CgroupV1Subsystem uses _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size(); Again, OS code. The latter, I believe, since day one of the container support code.

I guess it would be conceivable to move the "arbitration" to OSContainer dealing with limits exceeding physical memory. It could either pass upper bounds down to the CG layer or let it do it's thing first and do the sanity checking after. Would something like that be more amenable?

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.

That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.

OK.

Aug 25 '22 17:08 jerboaa

So, before we had:

os::physical_memory()  
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()

Made sense if one thinks of the os layer as the arbiter that has to lie to the caller to abstract aways platform stuff, and the Linux layer as the real truth source.

Similar for available memory:

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host available memory

Already a bit crooked, since here the splicing is done at the ::Linux layer.

With this patch we have:

os::physical_memory()  
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

and

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host physical memory
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

This is getting a bit hard to understand. The only one behaving as advertised is os::Linux::physical_memory(), everyone else is somehow splicing other information in. Of course a lot of this is preexisting and has nothing to do with this patch. But it deepens the confusion.

One thing I don't understand is, why does this calculation have to be done at every invocation to os::available_memory()/os::physical_memory()/OSContainer::memory_limit_in_bytes() etc? Yes, cgroup limits can change, but AFAIK there is nothing in the VM that can react to these changes anyway. Heap geometry etc. gets sorted out at VM start.

So why do we do not just do this:

  // We need to update the amount of physical memory now that
  // cgroup subsystem files have been processed.
-  if ((mem_limit = cgroup_subsystem->memory_limit_in_bytes()) > 0) {
+  size_t mem_limit = cgroup_subsystem->memory_limit_in_bytes();
+  if (mem_limit > 0 && mem_limit < os::Linux::physical_limit()) {
    os::Linux::set_physical_memory(mem_limit);
    log_info(os, container)("Memory Limit is: " JLONG_FORMAT, mem_limit);
  }

and be done with it?

Aug 25 '22 17:08 tstuefe

The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }
... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.

Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?

Aug 25 '22 17:08 tstuefe

I think the gist of my remark is that I would like the layers to behave consistently.

I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

let os::Linux::available_memory() and os::Linux::physical_memory() return the pure host values
let the cgroup system return the pure cgroup values
let os::available_memory() and os::physical_memory() return either one or the other depending on what makes sense.

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

Aug 25 '22 17:08 tstuefe

The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }
... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.
Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?

Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have max, cg1 just contains random large numbers (if unset).

Aug 25 '22 17:08 jerboaa

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have max, cg1 just contains random large numbers (if unset).

Oh, ok. Fair enough. Then my only question is at what layer we want the heuristic to happen.

Aug 25 '22 17:08 tstuefe

I think the gist of my remark is that I would like the layers to behave consistently.

I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

You mean OSContainer::memory_limit_in_bytes() right?

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

There is also this OSContainer hybrid ;-)

* let `os::Linux::available_memory()` and `os::Linux::physical_memory()` return the pure host values
* let the cgroup system return the pure cgroup values
* let `os::available_memory()` and `os::physical_memory()` return either one or the other depending on what makes sense.

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sounds reasonable to me.

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

+1

Aug 25 '22 17:08 jerboaa

I think the gist of my remark is that I would like the layers to behave consistently. I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

You mean OSContainer::memory_limit_in_bytes() right?

Right.

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

There is also this OSContainer hybrid ;-)

Yes, that is what I meant with "CgroupSystem". Sorry, I collapsed that with its CGV1/V2 implementations for less confusion :)

Aug 25 '22 18:08 tstuefe

Hi @tstuefe ,

Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.

The structuring you lay out in this comment sounds good to me, with some caveats:

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.

@jerboaa :

There is also this OSContainer hybrid ;-)

I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.

@tstuefe:

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.

Aug 26 '22 09:08 jmtd

Hi @tstuefe ,

Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.

The structuring you lay out in this comment sounds good to me, with some caveats:

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.

Not necessarily. You could hand in an upper reasonable bound as argument.

But I'd be happy if we can get at least to a consistent layering like this:

os::Linux::get_xxx_memory -> the bottom layer, looks at host values
Cgroup subsystem -> middle layer, has access to os::Linux::get_xxx_memory
os::get_xxxx_memory() -> top, uses both Cgroup subsystem and os::Linux::get_xxx_memory.

@jerboaa :

There is also this OSContainer hybrid ;-)

I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.

+1

@tstuefe:

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.

Okay, lets ship this.

Aug 26 '22 10:08 tstuefe

/integrate

Aug 26 '22 11:08 jmtd

@jmtd Your change (at version af509e152353d40d40bd2107454bc87f77c2e775) is now ready to be sponsored by a Committer.

Aug 26 '22 11:08 openjdk[bot]

/sponsor

Aug 26 '22 12:08 jerboaa

@jmtd Please file an RFE ticket for the proposed API cleanup. Thanks!

Aug 26 '22 12:08 jerboaa

jdk jdk copied to clipboard

8292083: Detected container memory limit may exceed physical machine memory

Progress

Issue

Reviewing

Webrevs

jdk
jdk copied to clipboard