jdk
jdk copied to clipboard
8292083: Detected container memory limit may exceed physical machine memory
We discovered some systems configured with cgroups v1 which report a bogus container memory limit value which is above the physical memory of the host. OpenJDK then calculates flags such as InitialHeapSize based on this invalid value; this can be larger than the available memory which can result in the OS terminating the process due to OOM.
hotspot's container awareness attempts to sanity check the limit value by ensuring it's below _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()
, but that still leaves a large range of potential invalid values between physical RAM and that ceiling value.
Cgroups V1 in particular returns an uninitialised value for the memory limit when one has not been explicitly set. Cgroups v2 does not suffer the same problem: however, it's possible for any value to be set for the max memory, including values exceeding the available physical memory, in either v1 or v2.
This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans. I'd also look again at whether the existing ceiling value _unlimited_memory
serves any useful purpose. I personally don't feel those improvements should hold up this fix.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8292083: Detected container memory limit may exceed physical machine memory
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9880/head:pull/9880
$ git checkout pull/9880
Update a local copy of the PR:
$ git checkout pull/9880
$ git pull https://git.openjdk.org/jdk pull/9880/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9880
View PR using the GUI difftool:
$ git pr show -t 9880
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9880.diff
:wave: Welcome back jdowland! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@jmtd The following label will be automatically applied to this pull request:
-
hotspot-runtime
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 18: Full - Incremental (af509e15)
- 17: Full - Incremental (bd24938d)
- 16: Full - Incremental (ac646931)
- 15: Full - Incremental (d13fae18)
- 14: Full - Incremental (40930a81)
- 13: Full - Incremental (bc493c20)
- 12: Full - Incremental (60865050)
- 11: Full - Incremental (4984ddd9)
- 10: Full - Incremental (9de831d5)
- 09: Full (a88bb620)
- 08: Full - Incremental (66bb149d)
- 07: Full - Incremental (caa7913c)
- 06: Full - Incremental (7e64194b)
- 05: Full - Incremental (ff57cf41)
- 04: Full - Incremental (fc2ae1b9)
- 03: Full - Incremental (98664643)
- 02: Full - Incremental (7f5307d6)
- 01: Full - Incremental (8d7e80c6)
- 00: Full (2de864c8)
Thanks for the review! I agree with all your points.
This looks mostly good. I'd prefer if we changed the test to not rely on
InitialHeapSize
as that might get ergonomically set.
Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with -XX:InitialRAMPercentage
set, InitialHeapSize
is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.
This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans.
I've filed https://bugs.openjdk.org/browse/JDK-8292541 for the Java side.
This looks mostly good. I'd prefer if we changed the test to not rely on
InitialHeapSize
as that might get ergonomically set.Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with
-XX:InitialRAMPercentage
set,InitialHeapSize
is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.
Hence my suggestion to check what os::physical_memory()
returns via WhiteBox
. Many GC heap settings are derived from physical memory. So checking that it never exceeds actual physical host memory should be sufficient.
@jmtd Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information.
It appears os::Linux::available_memory()
would need similar treatment, still.
It appears
os::Linux::available_memory()
would need similar treatment, still.
I've pushed a simple fix for this as caa7913c8a1, but unfortunately this makes the logging noisiness far worse.
Deeper in the call stack of the cgroups-specific code there is some caching to avoid re-reading and parsing the cgroups filesystem too frequently. I wonder if it would be better to have the sanity checking logic behind the caching. Since I think it's better to have the sanity-checking logic at the level of os::physical_memory
, I might see whether the caching can be brought up to that level.
@jmtd this pull request can not be integrated into master
due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout 8292083-cgroups-badmaxmem
git fetch https://git.openjdk.org/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
That last fix (6086505) has resolved the increase in logging output too.
@jmtd This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8292083: Detected container memory limit may exceed physical machine memory
Reviewed-by: sgehwolf, stuefe
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 59 new commits pushed to the master
branch:
- f91943c19fc0b060684a437d2c768461d54c088e: 8292868: Explicitly pass a third temp register to MacroAssembler::store_heap_oop for aarch64
- 6354a57b5cb85d31ea70a998202470467402b669: 8290711: assert(false) failed: infinite loop in PhaseIterGVN::optimize
- 3e187730162965981a5e6d238935e46d1015708e: 8292880: Improve debuggee logging for com/sun/jdi/ClassUnloadEventTest.java
- 909e1edb188ead748bd452067ca06d6e91aee4c6: 8292919: Build failure due to UseJVMCICompiler was not declared when C2 is disabled after JDK-8292691
- 55f5a83b88d7259bf7965ff12abd8dff4f79315f: 8282410: Remove SA ProcDebugger support
- d83faeaf9ab3584de6af23de16aad3738d179c86: 8292250: Create test for co-located JDI MethodEntry, Step, and Breakpoint events
- e353b572a54edbbf0df1f01afa36067500157603: 8292890: Remove PrintTouchedMethodsAtExit and LogTouchedMethods
- 95a33fe1502b6f0db2c60fa92b389fda74d94407: 8292314: Cleanup legacy address handling
- 5d799d80e638b85fa3881904e7330ffb5100764a: 8292304: [REDO] JDK-8289208 Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS
- 4f50316a1a985cd06af7eed158d7e1917b86d159: 8292680: Convert Dictionary to ConcurrentHashTable
- ... and 49 more: https://git.openjdk.org/jdk/compare/e561933907bbab0a42f1796fa12f582b3a347312...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jerboaa, @iklam, @tstuefe) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate
in a new comment. (Afterwards, your sponsor types /sponsor
in a new comment to perform the integration).
Looks good to me. Please get a second review for this, though, if you can.
Thanks @jerboaa much appreciated.
@tstuefe, you provided some valuable feedback on this PR in an earlier incarnation. Would you be prepared to take another look?
GHA reports that 1/9 tests in linux-x86 / test (hs/tier1 gc) failed. Digging into the logs, this was gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-32, which seems unrelated to these changes. I've just ran this one locally and it passed for me.
Hmm, I get uncomfortable if APIs mix layers and try to be smart.
As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.
In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.
That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.
Hmm, I get uncomfortable if APIs mix layers and try to be smart.
As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.
In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.
All fair points, but this patch doesn't change the status quo. For example, CgroupSubsystem::active_processor_count()
uses os::Linux::active_processor_count()
to bound the cores reported to the JVM above by the physical cores (or virtualized cores in a VM). Similarly, CgroupV1Subsystem
uses _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size();
Again, OS code. The latter, I believe, since day one of the container support code.
I guess it would be conceivable to move the "arbitration" to OSContainer
dealing with limits exceeding physical memory. It could either pass upper bounds down to the CG layer or let it do it's thing first and do the sanity checking after. Would something like that be more amenable?
The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
"Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
if (memlimit >= _unlimited_memory) {
...
} else {
return (jlong)memlimit;
}
... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()
which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056
whereas the _unlimited_memory
value was 9223372036854771712
). That on a system with 8GB physical memory.
That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.
OK.
So, before we had:
os::physical_memory()
|-----------------------------------------
| |
v v
OSContainer::memory_limit_in_bytes() os::Linux::physical_memory()
Made sense if one thinks of the os
layer as the arbiter that has to lie to the caller to abstract aways platform stuff, and the Linux
layer as the real truth source.
Similar for available memory:
os::available_memory()
|
v
os::Linux::available_memory()
|
|-----------------------------------------
| |
v v
OSContainer::memory_limit_in_bytes() or returns host available memory
Already a bit crooked, since here the splicing is done at the ::Linux
layer.
With this patch we have:
os::physical_memory()
|
|-----------------------------------------
| |
v v
OSContainer::memory_limit_in_bytes() os::Linux::physical_memory()
|
|-----------------------------------------
| |
v v
os::Linux::physical_memory() or returns cgroup limit
and
os::available_memory()
|
v
os::Linux::available_memory()
|
|-----------------------------------------
| |
v v
OSContainer::memory_limit_in_bytes() or returns host physical memory
|
|-----------------------------------------
| |
v v
os::Linux::physical_memory() or returns cgroup limit
This is getting a bit hard to understand. The only one behaving as advertised is os::Linux::physical_memory()
, everyone else is somehow splicing other information in. Of course a lot of this is preexisting and has nothing to do with this patch. But it deepens the confusion.
One thing I don't understand is, why does this calculation have to be done at every invocation to os::available_memory()/os::physical_memory()/OSContainer::memory_limit_in_bytes() etc? Yes, cgroup limits can change, but AFAIK there is nothing in the VM that can react to these changes anyway. Heap geometry etc. gets sorted out at VM start.
So why do we do not just do this:
// We need to update the amount of physical memory now that
// cgroup subsystem files have been processed.
- if ((mem_limit = cgroup_subsystem->memory_limit_in_bytes()) > 0) {
+ size_t mem_limit = cgroup_subsystem->memory_limit_in_bytes();
+ if (mem_limit > 0 && mem_limit < os::Linux::physical_limit()) {
os::Linux::set_physical_memory(mem_limit);
log_info(os, container)("Memory Limit is: " JLONG_FORMAT, mem_limit);
}
and be done with it?
The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() { GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes", "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit); if (memlimit >= _unlimited_memory) { ... } else { return (jlong)memlimit; }
... might return arbitrary large values on some systems (note that
_unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()
which can be fairly large; we've observed systems where the cgroup interface files have92233720365056
whereas the_unlimited_memory
value was9223372036854771712
). That on a system with 8GB physical memory.
Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?
I think the gist of my remark is that I would like the layers to behave consistently.
I see that CgroupSubsystem::memory_limit_in_bytes()
is only used in two places, os::Linux::available_memory()
and os::physical_memory
.
I would say let the os
layer lie and Linux
and CgroupSystem
be the truth. Then we end up with a clear hierarchy:
- let
os::Linux::available_memory()
andos::Linux::physical_memory()
return the pure host values - let the cgroup system return the pure cgroup values
- let
os::available_memory()
andos::physical_memory()
return either one or the other depending on what makes sense.
In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).
Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.
The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() { GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes", "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit); if (memlimit >= _unlimited_memory) { ... } else { return (jlong)memlimit; }
... might return arbitrary large values on some systems (note that
_unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()
which can be fairly large; we've observed systems where the cgroup interface files have92233720365056
whereas the_unlimited_memory
value was9223372036854771712
). That on a system with 8GB physical memory.Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?
Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have max
, cg1 just contains random large numbers (if unset).
The gist of this patch is code like this:
jlong CgroupV1Subsystem::read_memory_limit_in_bytes() { GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes", "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit); if (memlimit >= _unlimited_memory) { ... } else { return (jlong)memlimit; }
Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have
max
, cg1 just contains random large numbers (if unset).
Oh, ok. Fair enough. Then my only question is at what layer we want the heuristic to happen.
I think the gist of my remark is that I would like the layers to behave consistently.
I see that
CgroupSubsystem::memory_limit_in_bytes()
is only used in two places,os::Linux::available_memory()
andos::physical_memory
.
You mean OSContainer::memory_limit_in_bytes()
right?
I would say let the
os
layer lie andLinux
andCgroupSystem
be the truth. Then we end up with a clear hierarchy:
There is also this OSContainer
hybrid ;-)
* let `os::Linux::available_memory()` and `os::Linux::physical_memory()` return the pure host values * let the cgroup system return the pure cgroup values * let `os::available_memory()` and `os::physical_memory()` return either one or the other depending on what makes sense.
In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).
Sounds reasonable to me.
Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.
+1
I think the gist of my remark is that I would like the layers to behave consistently. I see that
CgroupSubsystem::memory_limit_in_bytes()
is only used in two places,os::Linux::available_memory()
andos::physical_memory
.You mean
OSContainer::memory_limit_in_bytes()
right?
Right.
I would say let the
os
layer lie andLinux
andCgroupSystem
be the truth. Then we end up with a clear hierarchy:There is also this
OSContainer
hybrid ;-)
Yes, that is what I meant with "CgroupSystem". Sorry, I collapsed that with its CGV1/V2 implementations for less confusion :)
Hi @tstuefe ,
Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.
The structuring you lay out in this comment sounds good to me, with some caveats:
In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).
Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.
@jerboaa :
There is also this OSContainer hybrid ;-)
I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.
@tstuefe:
Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.
It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.
Hi @tstuefe ,
Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.
The structuring you lay out in this comment sounds good to me, with some caveats:
In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).
Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.
Not necessarily. You could hand in an upper reasonable bound as argument.
But I'd be happy if we can get at least to a consistent layering like this:
- os::Linux::get_xxx_memory -> the bottom layer, looks at host values
- Cgroup subsystem -> middle layer, has access to os::Linux::get_xxx_memory
- os::get_xxxx_memory() -> top, uses both Cgroup subsystem and os::Linux::get_xxx_memory.
@jerboaa :
There is also this OSContainer hybrid ;-)
I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.
+1
@tstuefe:
Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.
It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.
Okay, lets ship this.
/integrate
@jmtd Your change (at version af509e152353d40d40bd2107454bc87f77c2e775) is now ready to be sponsored by a Committer.
/sponsor
@jmtd Please file an RFE ticket for the proposed API cleanup. Thanks!