[Bug] JobManager frequent GC causes Yarn container memory overflow

Open 13301891422 opened this issue 2 years ago • 1 comments

Search before asking

[X] I had searched in the issues and found no similar issues.

Java Version

1.8.0_212

Scala Version

2.12.x

StreamPark Version

2.0.0

Flink Version

1.15.4

deploy mode

yarn-application

What happened

When I submit a Flink On Yarn (Yarn application mode) task using StreamPark, the JobManager's parameters look like this:

jobmanager.memory.heap.size 469762048b jobmanager.memory.jvm-metaspace.size 268435456b jobmanager.memory.jvm-overhead.max 201326592b jobmanager.memory.jvm-overhead.min 201326592b jobmanager.memory.off-heap.size 134217728b jobmanager.memory.process.size 1024mb

After the task runs for a period of time (about 3 to 20 days), the Container running the JobManager will always be killed by ResourceManager. Then I start the GC log of the JobManager. The process that discovered JobManager performs a next-generation GC about every 2 minutes or so, as follows:

2023-08-30T13:56:57.694+0800: [GC (Allocation Failure)] [PSYoungGen: 149956K->1673K(150528K)] 315127K->166876K(456704K), 0.0138514 secs] [Times: user=0.54 sys=0.05, real=0.02 secs] 2023-08-30T13:59:17.558+0800: [GC (Allocation Failure)] [PSYoungGen: 150141K->1636K(150528K)] 315344K->166871K(456704K), 0.0285263 secs] [Times: user= 1.20sys =0.11, real=0.03 secs] ... 2023-08-30T14:47:54.412+0800: [GC (Allocation Failure)] [PSYoungGen: 148425K->1700K(150016K)] 314796K->168135K(456192K), 0.0258613 secs] [Times: user= 0.96sys =0.06, real=0.03 secs] 2023-08-30T14:50:12.434+0800: [GC (Allocation Failure)] [PSYoungGen: 149138K->1156K(150016K)] 315573K->167607K(456192K), 0.0233593 secs] [Times: user=0.77 sys=0.07, real=0.03 secs]

In order to understand the cause of JobManager's frequent GC, I dump the objects in JobManager's java heap into local files, and then use VisualVM to open them for analysis, and find that Char[] occupies the largest memory space, as shown in the following figure:

What are the reasons for this? If we use FLINK_HOME/bin/flink run t yarn-per-job to submit the task from the command line, we will not generate so many Char[]. The GC time of JobManager (this program's parameters are exactly the same as the above parameters) is about once every 40 minutes. This situation seems to be relatively normal

As for the reason why containers are frequently killed, we will set jobmanager.memory.enable-jvm-direct-memory-limit = true to avoid memory overlimit. Do we know whether this parameter is useful for memory overlimit killing?

Error Exception

Failing this attempt.Diagnostics: [2023-08-22 08:49:10.443]Container [pid=77475,container/D=container_e08_1683881703260_1165_01 0000011 running 9510912B beyond the
PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used. Killing container

Screenshots

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!(您是否要贡献这个PR?)

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Aug 30 '23 07:08 13301891422

Thank you for the detailed feedback. StreamPark is merely a platform for managing and submitting flinkjob. You can look into your flinkjob itself to investigate further reasons.

Sep 02 '23 06:09 wolfboys