openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

openj9 21.0.7 oom

Open lp2jx0309 opened this issue 6 months ago • 13 comments

After upgrading openj9 from 21.0.3 to 21.0.7, the application will inexplicably experience OOM. During concurrent thread operations, the size of Concurrent HashMap suddenly increases from 1 to the maximum value for shaping. The code Sets.newConcurrentHashSet() is an implementation of guava, and the underlying layer uses Concurrent HashMap The HashMap used at the bottom layer of Sets.newHashSet (allocStacks) is OOM because the size reaches the maximum shaping value, resulting in the size of newHashSet also being the maximum shaping value. why the size of Concurrent HashMap suddenly increases from 1 to its maximum value during concurrent thread operations This only occurs when using shared class cache, and if the shared class cache is removed, it will not occur.

ConnectionAllocMonitor:

  private Set<ConnectionAllocStack> allocStacks = Sets.newConcurrentHashSet();
  
  private long lastOutTime = 0;
  
  
  public void addConnectionAllocStack(ConnectionAllocStack allocStack){
    allocStacks.add(allocStack);
    logger.info("====" + this + ": " + allocStacks.size());
    if(allocStacks.size()>=poolMaxSize&){
      Set<ConnectionAllocStack> tempallocStacks = Sets.newHashSet(allocStacks);
      logger.info(">>>>>>>>> connection pool is full ,output statck info:");
      tempallocStacks.stream().forEach(stack->{
        
      });
      
      
      logger.info(">>>>>>>>> end connection output statck");
      lastOutTime= System.currentTimeMillis();
      
    }
    
  }

log: 2025-05-30 09:56:14 453 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][Thread-9] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@234b0671: 1 2025-05-30 09:56:14 462 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][Thread-9] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@556d58e7: 1 2025-05-30 09:56:14 652 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][Thread-9] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@556d58e7: 1 2025-05-30 09:56:21 946 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][app-executor-lv:1-idx:1] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@556d58e7: 2147483647 2025-05-30 09:56:21 946 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][app-executor-lv:1-idx:3] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@556d58e7: 2147483647 2025-05-30 09:56:21 947 INFO [com.tmp.db.jdbi.ConnectionAllocMonitor][app-executor-lv:1-idx:6] - ====com.tmp.db.jdbi.ConnectionAllocMonitor@556d58e7: 2147483647

lp2jx0309 avatar May 30 '25 02:05 lp2jx0309

Does using the shared class cache with the option -Xnoaot resolve the problem?

pshipton avatar May 30 '25 13:05 pshipton

Also as a separate test, pls delete the existing shared class cache, and then run with -XX:-ShareOrphans along with a shared class cache to see if that resolves the problem.

pshipton avatar May 30 '25 13:05 pshipton

Does using the shared class cache with the option -Xnoaot resolve the problem?使用带有选项 -Xnoaot 的共享类高速缓存是否可以解决问题?

Can solve the problem

lp2jx0309 avatar Jun 03 '25 02:06 lp2jx0309

Also as a separate test, pls delete the existing shared class cache, and then run with -XX:-ShareOrphans along with a shared class cache to see if that resolves the problem.

Not using shared class cache will not result in OOM, but once added, it will result in OOM. What is the reason for OOM?

lp2jx0309 avatar Jun 03 '25 02:06 lp2jx0309

If using -Xnoaot solves the problem, then there is an issue with the AOT code in the share cache.

Is it possible to get a reproducible test case?

@hzongaro fyi

pshipton avatar Jun 03 '25 03:06 pshipton

If using -Xnoaot solves the problem, then there is an issue with the AOT code in the share cache.

Is it possible to get a reproducible test case?

@hzongaro fyi

i cannot reproduce the test cases.My usage scenario is that the class cache is created in Docker image construction and used directly in the business container. If the class cache is regenerated in the business container and reused, it will not cause OOM.

lp2jx0309 avatar Jun 03 '25 06:06 lp2jx0309

Through exclusion method, it was found that it was caused by this submission https://github.com/eclipse-openj9/openj9/pull/20937/commits/83b0f899e5cf2fbd2bc26fdafc1b1df696f7a8ee @pshipton

lp2jx0309 avatar Jun 04 '25 03:06 lp2jx0309

This is https://github.com/eclipse-openj9/openj9/pull/20937

What platform are you running on?

@keithc-ca fyi

pshipton avatar Jun 04 '25 11:06 pshipton

What's the progress on this issue? I also encountered this problem

DHbigfart avatar Jun 12 '25 02:06 DHbigfart

@DHbigfart pls confirm which platform and version you are running on.

pshipton avatar Jun 12 '25 12:06 pshipton

This is #20937

What platform are you running on?

@keithc-ca fyi

linux x86

lp2jx0309 avatar Jun 13 '25 01:06 lp2jx0309

@DHbigfartpls confirm which platform and version you are running on.

<attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
  <attribute name="maxHeapSize" value="0x180000000" />
  <attribute name="initialHeapSize" value="0x60000000" />
  <attribute name="compressedRefs" value="true" />
  <attribute name="compressedRefsDisplacement" value="0x0" />
  <attribute name="compressedRefsShift" value="0x3" />
  <attribute name="pageSize" value="0x1000" />
  <attribute name="pageType" value="not used" />
  <attribute name="requestedPageSize" value="0x1000" />
  <attribute name="requestedPageType" value="not used" />
  <attribute name="gcthreads" value="4" />
  <attribute name="gcthreads Concurrent Mark" value="1" />
  <attribute name="packetListSplit" value="1" />
  <attribute name="cacheListSplit" value="1" />
  <attribute name="splitFreeListSplitAmount" value="1" />
  <attribute name="numaNodes" value="0" />
  <system>
    <attribute name="physicalMemory" value="8589934592" />
    <attribute name="addressablePhysicalMemory" value="8589934592" />
    <attribute name="container memory limit set" value="true" />
    <attribute name="numCPUs" value="8" />
    <attribute name="numCPUs active" value="4" />
    <attribute name="architecture" value="amd64" />
    <attribute name="os" value="Linux" />
    <attribute name="osVersion" value="5.10.134-13.1.zncgsl6.x86_64" />
  </system>
  <vmargs>
    <vmarg name="-Xlockword:mode=default,noLockword=java/lang/String,noLockword=java/util/MapEntry,noLockword=java/util/HashMap$Entry,noLockword..." />
    <vmarg name="-XX:+EnsureHashed:java/lang/Class,java/lang/Thread" />
    <vmarg name="-Xjcl:jclse29" />
    <vmarg name="-Djava.class.path=." />
    <vmarg name="-Xms1536m" />
    <vmarg name="-Xmx6144m" />
    <vmarg name="-Xdump:heap:events=systhrow+user,filter=java/lang/OutOfMemoryError,request=exclusive+prepwalk+compact,label=/home/zenap/dump/du..." />
    <vmarg name="-Xdump:none" />
    <vmarg name="-Xdump:system:events=gpf+abort+traceassert,range=1..0,priority=999,request=serial,label=/home/zenap/dump/core-dump-2025-05-27-0..." />
    <vmarg name="-Xdump:heap:events=systhrow,filter=java/lang/OutOfMemoryError,range=1..1,priority=500,request=exclusive+compact+prepwalk,label=..." />
    <vmarg name="-Xdump:heap:events=user,priority=500,request=exclusive+compact+prepwalk,label=/home/zenap/dump/dump-dump-user-2025-05-27-01-47-..." />
    <vmarg name="-Xdump:java:events=systhrow,filter=java/lang/OutOfMemoryError,range=1..1,priority=400,request=exclusive+preempt,label=/home/zen..." />
    <vmarg name="-Xdump:java:events=gpf+abort+traceassert+user,priority=400,request=exclusive+preempt,label=/home/zenap/dump/javacore-dump-2025-..." />
    <vmarg name="-Xdump:snap:events=systhrow,filter=java/lang/OutOfMemoryError,range=1..1,priority=300,request=serial,label=/home/zenap/dump/sna..." />
    <vmarg name="-Xdump:snap:events=gpf+abort+traceassert,priority=300,request=serial,label=/home/zenap/dump/snap-dump-2025-05-27-01-47-29.%seq...." />
    <vmarg name="-Xverbosegclog:/home/zenap/gclog/gc-2025-05-27-01-47-29.log,1,10000" />
    <vmarg name="-Xquickstart" />
    <vmarg name="-Dfile.encoding=UTF-8" />
    <vmarg name="-Xlp:objectheap:pagesize=4K" />
    <vmarg name="-Xlp:codecache:pagesize=4K" />
    <vmarg name="-XX:+UseContainerSupport" />
    <vmarg name="-Xshareclasses:name=ZenapWarmCache,cacheDir=/sharedclasscache,readonly" />
    <vmarg name="-Dotel.metrics.exporter=none" />
    <vmarg name="-Xdump:tool:events=systhrow,opts=ASYNC,filter=java/lang/OutOfMemoryError,exec=sleep 120s &amp;&amp; kill %pid &amp;&amp; sleep ..." />
    <vmarg name="-Dsun.java.launcher=SUN_STANDARD" />
    <vmarg name="-Dsun.java.launcher.pid=1" />
  </vmargs>

DHbigfart avatar Jun 13 '25 06:06 DHbigfart

To proceed we need either a system core file produced at the time of the OOM, or a reproducible test case. Even a javacore file would be helpful.

pshipton avatar Jun 13 '25 12:06 pshipton

Moving this out for now until we have a core file or some other dump file to work.

hzongaro avatar Aug 13 '25 19:08 hzongaro