ods-jenkins-shared-library
ods-jenkins-shared-library copied to clipboard
releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run
We run jenkins with EDP - for document generation - and can see that memory used during the run is not reclaimed / freed up after

the run started at 5:44pm - and was done at 11pm CET - you can see - no GC // reclaim
@braisvq1996 @albertpuente
@clemensutschig we looked into this. What we detected is that the available memory is immediately consumed upon start, starting with 10 GB and quickly going up to 14 GB. We also see regular garbage collector activity and that Jenkins' underlying Hudson implementation keeps up to 5 M references to ArrayList$Itr objects, which are small though and only sum up to roughly 150 MB.
We do see regular suspension times induced by the GC, which is related to taking objects from the Eden Space into the Old Gen. I assume that our configuration has something to do with this, but it came as a recommendation from CloudBees, see https://github.com/opendevstack/ods-core/issues/412.
@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?
hey @metmajer this is interesting - do we have a view what is stored in this extensionLists ... ? I took a look what's in the doc about this .. maybe here is our culprit .. .?!
@clemensutschig yes, might make a difference, but could depend on the GC. Let's not take for granted what CloudBees recommended some time ago. The following article might be interesting: https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector
@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?
Hi there. Yes, it should affect the memory consumption because we will stop using some SharedLibrary functionality (that will be removed later), but we do not know yet if it will really be the improvement we are looking for.
We are requesting access to the monitor tools you show in this issue. The more details we have, the more concise will be our answer.

4.x with a super simple testcase (RM + Jira + one (failing) test component - it's not a lot per run .. but it's definitely accumulating .. the start of the graph is a restart of jenkins - to have a clean base .. and then one run after another ..
20 runs - one gig of RAM that is not reclaimed ..
Here is an overview on the Jenkins configuration we are currently using: https://github.com/opendevstack/ods-core/issues/412#issuecomment-601589479
even with keep only 2 failures (via build config, and hence deleting the rest) - there is no difference:

we see the same growth .. if you zoom in - you see GC kicking in - and reclaiming 10 or less megs after every run - but overall we see (close to) linear increase. This leads me (for now) to believe that there is a mem leak,
even with a whole night - no real GC reclaim happend ... I am trying now to get this thing to an oom

hurray :)
Terminated at Mar 9, 2022 1:40:55 PM with exit code 137 (OOMKilled) (5103megs)
@clemensutschig you're the first person that's happy about an OOM :) Are you able to achieve better insights? Thinking the the GC options might need a re-evaluation.
I am trying to get better insights - and cleanup all sorts of things now in my branch - to see if that impacts anything ..
what's interesting - I have turned on cleanup (thru build config) of failed builds .. and take a look below:
rg.jenkinsci.plugins.workflow.job.WorkflowRun finish testbug-cd/testbug-cd-release-master #185 completed: FAILURE
--
| 2022-03-09 13:03:05 INFO io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onCompleted onCompleted job/testbug-cd/job/testbug-cd-release-master/185/
| 2022-03-09 13:03:05 INFO io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Deleting run: testbug-cd/testbug-cd-release-master #183
| 2022-03-09 13:03:05 INFO io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onFinalized onFinalized job/testbug-cd/job/testbug-cd-release-master/185/
| 2022-03-09 13:03:05 INFO io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onDeleted onDeleted job/testbug-cd/job/testbug-cd-release-master/183/
| 2022-03-09 13:03:05 INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls
| 2022-03-09 13:03:05 WARNING io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Unable to delete run testbug-cd/testbug-cd-release-master #183:Unable to delete '/var/lib/jenkins/jobs/testbug-cd/jobs/testbug-cd-release-master/builds/.183'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
| 2022-03-09 13:05:35 INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls
ok - more tries (only keeping 2 failed runs, which the belows are): All data is Container memory based on ocp console::
restart: 3246m mem used 17:27: start FIRST no JIRA NO DOCS rm run 17:32: run done - 3440m mem used (+155m)
17:43: start SECOND no JIRA NO DOCS rm run (agent pods were removed) 17:49: run done - 3515m mem used (+75m) 17:58: 3510m mem used (GC: - 4m)
17:59: start THIRD no JIRA NO DOCS rm run (agent pods were removed) 18:06: run done - 3565m mem used (+50m) 18:15: 3561m mem used (GC: - 4m)
18:15: start FOURTH no JIRA NO DOCS rm run (agent pods were removed) 18:21: run done - 3634 m mem used (+70m)
I am trying to understand why this is steadly growing - and I really CANNOT .. .
ps - in case someone curious - I added
} finally {
logger.debug('-- SHUTTING DOWN RM (..) --')
logger.resetStopwatch()
project.clear()
ServiceRegistry.instance.clear()
UnirestConfig.shutdown()
project = null
git = null
logger = null
repos = null
steps = null
}
to the odsOrchestrationStage - at the almost very root ... :( and still nothing changes
I am really completely utterly confused ... : xmx = 1g (top shows java 1gig mem) and I am running happily component jobs ..
however my full container? memory is still growing ... ?! (container memory set min4 max5g) - 1pod and ONE container only


how come java goes to 1.3?
ok - now I used max 3g (limits 3gi) - with 0,5 heap percentage and -Xmx1g.
Top: 1,3g Java and overall pod memory at 1600m still growing ...
from the logs:
exec java -server -Xmx1g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Duser.home=/var/lib/jenkins -Djavamelody.application-name=JENKINS -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.tasks.MailSender.SEND_TO_UNKNOWN_USERS=true -Dhudson.tasks.MailSender.SEND_TO_USERS_WITHOUT_READ=true -jar /usr/lib/jenkins/jenkins.war
--
| Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true
| Running from: /usr/lib/jenkins/jenkins.war
| webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
VM flags (based on xms / xmx = 1g and metaspace 1g):
-XX:CICompilerCount=2 -XX:CompressedClassSpaceSize=1040187392 -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:MaxMetaspaceSize=1048576000 -XX:MaxNewSize=357892096 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=357892096 -XX:NonNMethodCodeHeapSize=5825164 -XX:NonProfiledCodeHeapSize=122916538 -XX:OldSize=715849728 -XX:ProfiledCodeHeapSize=122916538 -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseSerialGC
more diagnosis .. (I have one testcomponent Ods job that runs over and over again)

vs a couple minutes later

vs a couple mins later

@clemensutschig I'd like to do an analysis to identify memory leaks, how can I get the heap dump file? Did you generate a heap dump?
native memory tracking:
12:21 UTC
Total: reserved=2773758KB +72012KB, committed=1508758KB +131592KB
12:30 UTC
Total: reserved=2759677KB +57931KB, committed=1506845KB +129679KB
I am now forcing a maxOfRam with 90% .. lets see where that brings us .. see below - class is going up reserved and also committed - although its the exact same job running, from the exact same commit!
- Class (reserved=1287679KB +79246KB, committed=267303KB +87822KB)
(classes #31186 +4625)
( instance classes #29931 +4546, array classes #1255 +79)
(malloc=7679KB +1422KB #178934 +31441)
(mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
( Metadata: )
( reserved=231424KB +77824KB, committed=230184KB +78080KB)
( used=188192KB +46996KB)
( free=41992KB +31084KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=1048576KB, committed=29440KB +8320KB)
( used=20731KB +3677KB)
( free=8709KB +4643KB)
( waste=0KB =0.00%)
the same 10 mins later:
- Class (reserved=1287869KB +79436KB, committed=267493KB +88012KB)
(classes #32246 +5685)
( instance classes #30974 +5589, array classes #1272 +96)
(malloc=7869KB +1612KB #182833 +35340)
(mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
( Metadata: )
( reserved=231424KB +77824KB, committed=230184KB +78080KB)
( used=197988KB +56792KB)
( free=32196KB +21288KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=1048576KB, committed=29440KB +8320KB)
( used=21590KB +4536KB)
( free=7850KB +3784KB)
( waste=0KB =0.00%)
10 mins later:
Class (reserved=1296679KB +88246KB, committed=276303KB +96822KB)
(classes #35240 +8679)
( instance classes #33927 +8542, array classes #1313 +137)
(malloc=8487KB +2230KB #193209 +45716)
(mmap: reserved=1288192KB +86016KB, committed=267816KB +94592KB)
( Metadata: )
( reserved=239616KB +86016KB, committed=238376KB +86272KB)
( used=223608KB +82412KB)
( free=14768KB +3860KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=1048576KB, committed=29440KB +8320KB)
( used=24045KB +6991KB)
( free=5395KB +1329KB)
( waste=0KB =0.00%)
and now 20 mins later - something starts to collect
- Class (reserved=1299973KB +91541KB, committed=279981KB +100501KB)
(classes #30457 +3896)
( instance classes #29216 +3831, array classes #1241 +65)
(malloc=7685KB +1429KB #178768 +31275)
(mmap: reserved=1292288KB +90112KB, committed=272296KB +99072KB)
( Metadata: )
( reserved=243712KB +90112KB, committed=242216KB +90112KB)
( used=180605KB +39409KB)
( free=61611KB +50703KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=1048576KB, committed=30080KB +8960KB)
( used=20172KB +3118KB)
( free=9908KB +5842KB)
( waste=0KB =0.00%)
take a look at the class increase ... !!!!
( instance classes #29931 +4546, array classes #1255 +79) vs:
( instance classes #30974 +5589, array classes #1272 +96) vs:
( instance classes #33927 +8542, array classes #1313 +137) vs ÁFTER COLLECT:
( instance classes #29216 +3831, array classes #1241 +65)
what's interesting:
metadata reserved=243712KB +90112KB grows steadily ...
https://support.cloudbees.com/hc/en-us/articles/360029574172-Metaspace-Leak-Due-to-classes-not-being-cleaned-up
but unfortunately - we are using [Jenkins 2.289.1] :(
Class (reserved=818996KB +191857KB, committed=352564KB +213617KB)
(classes #42887 +21410)
( instance classes #41126 +20650, array classes #1761 +760)
(malloc=12084KB +7537KB #243483 +135258)
(mmap: reserved=806912KB +184320KB, committed=340480KB +206080KB)
( Metadata: )
( reserved=303104KB +184320KB, committed=301568KB +184064KB)
( used=253997KB +145475KB)
( free=47571KB +38589KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=503808KB, committed=38912KB +22016KB)
( used=28052KB +14529KB)
( free=10860KB +7487KB)
( waste=0KB =0.00%)
running classloader stats: VM.classloader_stats
Total = 6316 43487 317526016 293961896
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
with now another run (with shared lib)
Total = 6749 45205 331855872 307339496
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
and now withOUT shared lib import ´(just echo ...)
Total = 6763 45212 331902976 307372080
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
and now with shared lib import (and echo .... )
Total = 6787 45219 331950080 307417400
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
sooo this is a classloader reference leak - something hangs on to classloaders - and it reproduces W/O any shared lib - it just gets worse with shared libs ... :/
ok - some more testing with classloader stats
after start (NO running jobs):
Total = 897 21312 135653376 123512712
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
after running a naked job (NO shared lib - just echo)
Total = 1010 22726 144394240 132021864
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
wait a few mins ....
Total = 1038 22744 144835584 132366424
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
after running a job (with shared lib import and echo)
Total = 1139 22870 145810432 133105960
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
running the same job again
Total = 1183 22907 146402304 133509704
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
this is really scary ... :(
I started to play with the heap dump - and classloaders .... and all big classloaders have the common ancestor:

if I look into the classloader - this is definetely ours ... (with grapes being there ... )

but there is a second guy holding (the first is the finalizer of the vm that tries to clean it) - that starts to spark my interest ... (via incoming references)

this is an enum .. hmmmm
The annoying thing is that in a second threaddump hours later these guys are gone ;-( so it must be something else ..
https://issues.jenkins.io/plugins/servlet/mobile#issue/JENKINS-50223
Ok - findings:
a) 4g min/max request memory on container level do it - with the following settings
b) -server -XX:NativeMemoryTracking=summary -XX:MaxRAMPercentage=90 -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M and
c) -Xms1024m -Xmx1g -Dgroovy.use.classvalue=true
I am running release manager now over and over . but for now - I could not break it :)
The default setup we provide is definitely NOT good (5g limit container / 4g xmx and 1g MaxMetaspace) - this is defintely not sized correctly ... in my setup (as above) I see ´
Top Java process: ~2.9g Java native memory: Total: reserved=2895963KB +245454KB, committed=1673107KB +309194KB
and container showing 4g .. (with max being 4g).
still there is the leak - albeit less .. (we have 40+ of the below) - which in fact could be one per run

per run ....

working assumption a) grape import (https://github.com/opendevstack/ods-core/blob/master/jenkins/master/configuration/init.groovy.d/ods-jenkins-shared-library.groovy#L87) that is globally done (and hence we guess in some weird rootloader, doing even more weird stuff ... :( ) b) enums that also hold on to classloader
@stitakis will try to refactor (b) while I am on (a) ...
to go here: https://github.com/opendevstack/ods-core/blob/master/jenkins/master/ods-run.sh#L96
sh-4.2$ cat loadgrapes.sh
#!/bin/sh
echo $HOME
ivyhome=/tmp/ivy
unzip -o $HOME/plugins/workflow-cps-global-lib.jpi -d $ivyhome
java -cp $HOME/war/WEB-INF/lib/groovy-all-2.4.12.jar:$ivyhome/WEB-INF/lib/ivy-2.4.0.jar -Dgrape.config=$HOME/.groovy/grapeConfig.xml -Duser.home=$HOME groovy.ui.GroovyMain loadgrapes.groovy
rm -rf $ivyhome
sh-4.2$ cat loadgrapes.groovy
println "INFO: Grabbing grapes to avoid race condition during parallel compilation ..."
groovy.grape.Grape.grab(group: 'com.vladsch.flexmark', module: 'flexmark-all', version: '0.60.2')
groovy.grape.Grape.grab(group: 'fr.opensagres.xdocreport', module: 'fr.opensagres.poi.xwpf.converter.core', version: '2.0.2')
groovy.grape.Grape.grab(group: 'fr.opensagres.xdocreport', module: 'fr.opensagres.poi.xwpf.converter.pdf', version: '2.0.2')
groovy.grape.Grape.grab(group: 'org.apache.pdfbox', module: 'pdfbox', version: '2.0.17')
groovy.grape.Grape.grab(group: 'org.apache.poi', module: 'poi', version: '4.0.1')
groovy.grape.Grape.grab(group: 'net.lingala.zip4j', module: 'zip4j', version: '2.1.1')
groovy.grape.Grape.grab(group: 'org.yaml', module: 'snakeyaml', version: '1.24')
groovy.grape.Grape.grab(group: 'com.konghq', module: 'unirest-java', version: '2.4.03', classifier: 'standalone')
println "Done"
sh-4.2$
ok - update ... I think we have found the leak ... an assumption to be proven
we see circular references from Classes loaded via GroovyCleanClassloader (=> the RUN classloader) into AntClassloader (from CPS plugin) ..

and

e.g. one example:
OdsProjectMessagesUtil (c2746d78) -> CleanGroovyClassloader (ceeed928) -> AntClassloader (c1339b18)
----references------- --> CPS Function (c2746e30) -> AntClassloader (c1339b18)
----is referenced by <-- ConstantBlock (cc525d40) -> AntClassloader (c1339b18)
hence even if the CleanGroovyClassloader is released the CPS function will still hold on to it thru the shared lib class
and ...
https://github.com/jenkinsci/workflow-cps-plugin/blob/f522e783e917bc0aa0513a29684285b710163905/src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionMemoryTest.java
and https://github.com/jenkinsci/jenkins-test-harness/blob/master/src/main/java/org/jvnet/hudson/test/MemoryAssert.java (which is really scary ... )
what we cannot explain yet ... is if this is driven by the shared lib codebase or a generic jenkins "feature"