dcache
dcache copied to clipboard
JAVA heap memory limits not respected after 9.2.6 upgrade
Dear all,
An increase 13% of the memory usage on the pools is observed on dCache 9.2.6. This is observed in different pools on the system. Here an example
The Pool is enabled with the following values before and after the upgrade.
dcache.java.memory.heap=6144m
dcache.java.memory.direct=6144m
[${host.name}eightDomain]
[${host.name}eightDomain/pool]
pool.name=${host.name}_8
pool.tags=hostname=BCF
pool.path=/data/dcache_pool_8/pool
pool.wait-for-files=${pool.path}/data
When enabling JMX to monitor the JVM this is observed,
Other OS level tools shows that the actual memory usage is also higher than what before the 9.2.6 (8.2.X)
It is not clear to me why the limits are not respected. Please note that there are not relevant changes on the OS the java version is the same JAVA 11.
All the best, Carlos
Carlos,
Please check domain log. On startup it prints parameters passed to JVM:
2023-05-02 19:49:03 Launching /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m ....
Do you see your setting 6144m passed there?
Hello,
Indeed.
024-02-05 10:28:48 Launching /usr/bin/java -server -Xmx6144m -XX:MaxDirectMemorySize=6144m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dc235eightDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addr
On Feb 5, 2024, at 10:44 AM, Dmitry Litvintsev @.***> wrote:
Carlos,
Please check domain log. On startup it prints parameters passed to JVM:
2023-05-02 19:49:03 Launching /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m .... Do you see your setting 6144m passed there?
— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-1927293020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO264YDOFD4X3V6FPOLYSD47PAVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGI4TGMBSGA. You are receiving this because you authored the thread.
FYI for security issues the RT ticket [www.dcache.org #10590] Re: JAVA heap memory limits not respected after 9.2.6 upgrade GitHub issue #7507 is used now as there are java related dumps requested by dCache devs
Hi Carlos,
following up on this:
- When you say "JAVA heap memory limits not respected", do you mean that JVM uses more memory than specified by
-Xmx6144m -XX:MaxDirectMemorySize=6144m
? - Or that pool memory usage is up 10% ?
which is it (1) or (2) or both. Or both? If it is (1) then this is must be JAVA issue. If it is (2) this must be the consequence of creating transfer buffers on startup.
Yes, is more than the specified. See an example here for pools running
See the process
root 271305 0.0 0.0 112808 976 pts/0 S+ 09:16 0:00 \_ grep --color=auto 52204
root 52204 14.9 2.8 49247004 11100572 ? Ssl Feb09 5496:11 /usr/bin/java -server -Xmx6144m -XX:MaxDirectMemorySize=6144m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dc241eightDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addresses=system --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED --add-opens=java.base/java.math=ALL-UNNAMED --add-opens java.base/sun.nio.fs=ALL-UNNAMED -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start dc241eightDomain
The issue might be addressed by commit 8f3b984f
Hello,
Thank you. 9.2.14 was deployed on pools of the integration instance
So far it looks promising.
Hi Carlos. Interestingly, the potential fix mentioned by Tigran is not yet in 9.2.14, it still needs to be merged.
Hi Thank you for clarifying.
For now let us see how the memory evolves overtime.
All the best, Carlos
On Mar 8, 2024, at 3:23 PM, Lea @.***> wrote:
Hi Carlos. Interestingly, the potential fix mentioned by Tigran is not yet in 9.2.14, it still needs to be merged.
— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-1986366538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO4DTOMOTTFYWDOWM7LYXIM3ZAVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGM3DMNJTHA. You are receiving this because you authored the thread.
I took one of the recently decommissioned pool host and upgrade it to 9.2.15 from 9.2.6.
It looks better
Upgraded a production host with 15 pools with dcache 9.2.15
@cfgamboa Is there any updates on this? Do you still have this issue, and did pools fail with Out-of-Memmroy exceptions?
Hello,
No the pools did not crashed with Out of Memory executions not observed so far.
Nothing has changed to what is already reported.
All the best, Carlos
On Jun 25, 2024, at 10:41 AM, Tiramisu Mokka @.***> wrote:
@cfgamboa https://github.com/cfgamboa Is there any updates on this? Do you still have this issue, and did pools fail with Out-of-Memmroy exceptions?
— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-2189150582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO5WX6XLEAXX2L2GAX3ZJF6Q7AVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGE2TANJYGI. You are receiving this because you were mentioned.
If pools don't crash, why is the issue?
The issue is the high increase on memory usage after deploying 9.2. There is a 13% increase of memory as it was initially reported at the beginning of the ticket. Since then In some cases we have observed less than 3% free memory available to the system.
While the pools are not crashing the memory used by dCache was substantially increased after the upgrade. This is https://github.com/dCache/dcache/issues/7507#issuecomment-1980971444 an example of the memory used on the server and the memory allocated enable in the process.
All the best, Carlos
Hi Carlos.
Following up from the Tier 1, is it fair to say that you would be satisfied with a specific reason on our side why the required pool memory has increased? And that the reason for your concern are a matter of storage budget, meaning that you want us to acknowledge that we know of the change? Of course it would be preferable to all if the memory increase went away, but the situation is difficult to analyse and is unlikely to be resolved with a quick fix.
Kind regards Lea
Hi Lea,
If the cause of the increase in memory usage is not known yet and is under review by you, this is fair enough; at least I know you are aware of it. If this is expected, then of course we will need to take it into account for future requisitions.
Currently, the criticality is not high as the service is not disrupted by out-of-memory errors.
All the best, Carlos
With these parameters
dcache.java.memory.heap=6144m
dcache.java.memory.direct=6144m
The maximum memory usage for a Java process is the sum of the two, I think?:
6144 MB (heap)+6144 MB (direct)=12288 MB=12.288 GB
One of our data servers shows the following per pool process:
Domain Memory usage
dc268_1-Domain 13.047431945800781 GB
dc268_10-Domain 3.1094741821289062 GB
dc268_11-Domain 1.9457817077636719 GB
dc268_12-Domain 3.3018569946289062 GB
dc268_2-Domain 12.515079498291016 GB
dc268_3-Domain 13.274280548095703 GB
dc268_4-Domain 12.742572784423828 GB
dc268_5-Domain 12.708694458007812 GB
dc268_6-Domain 12.807502746582031 GB
dc268_7-Domain 11.783977508544922 GB
dc268_8-Domain 12.699535369873047 GB
dc268_9-Domain 12.95401382446289 GB
In some cases, the memory usage exceeds the value specified by the combined heap and direct memory settings. This is what Carlo reported, as it was not the case before 9.* release series.
The memory usage is important for resource provisioning. More memory seems to be needed for next purchase
With these parameters
dcache.java.memory.heap=6144m dcache.java.memory.direct=6144m
The maximum memory usage for a Java process is the sum of the two, I think?:
6144 MB (heap)+6144 MB (direct)=12288 MB=12.288 GB
One of our data servers shows the following per pool process:
Domain Memory usage dc268_1-Domain 13.047431945800781 GB dc268_10-Domain 3.1094741821289062 GB dc268_11-Domain 1.9457817077636719 GB dc268_12-Domain 3.3018569946289062 GB dc268_2-Domain 12.515079498291016 GB dc268_3-Domain 13.274280548095703 GB dc268_4-Domain 12.742572784423828 GB dc268_5-Domain 12.708694458007812 GB dc268_6-Domain 12.807502746582031 GB dc268_7-Domain 11.783977508544922 GB dc268_8-Domain 12.699535369873047 GB dc268_9-Domain 12.95401382446289 GB
In some cases, the memory usage exceeds the value specified by the combined heap and direct memory settings. This is what Carlo reported, as it was not the case before 9.* release series.
The memory usage is important for resource provisioning. More memory seems to be needed for next purchase
This is more complicated , see : https://stackoverflow.com/questions/15282178/java-using-up-far-more-memory-than-allocated-with-xmx
You seem to be concerned "what if sum (memory used by all pools) reaches 100%". And how much memory you need for provisioning in the future.
First concern, can you allocate memory to pools so the sum of it is exactly physical node memory size? And then run like that and monitor swap usage. Do you have swap activity.
Second, for instance at Fermilab, the pool nodes we buy have more memory than we ever use:
[root@pubstor2301 ~]# free
total used free shared buff/cache available
Mem: 263150780 36593124 2243884 26120 228000980 226557656
Swap: 4194300 26880 4167420
And when the pool nodes are bought we get "whatever is similar to what we have, but current gen."
Just real quick look
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1134414 root 20 0 39.3g 6.6g 13416 S 250.0 2.6 46586:59 java
1134551 root 20 0 26.0g 5.1g 15412 S 181.2 2.0 13654:32 java
1134287 root 20 0 39.0g 6.3g 17628 S 12.5 2.5 434:35.21 java
1134348 root 20 0 18.6g 2.1g 18384 S 6.2 0.8 202:57.73 java
Each pool has:
dcache.java.memory.heap=4096m
dcache.java.memory.direct=4096m
Are you speaking about virtual memory or resident? As we can see above RS exceeds 4GB by 50% in some cases. Does it create any problems? Does not look like.
Memory has never been a concern on our installation. From time to time we run out of java.memory.heap due to number of files and we typically double java.memory.heap
when this happens.
This is more complicated , see : https://stackoverflow.com/questions/15282178/java-using-up-far-more-memory-than-allocated-with-xmx
Thanks
You seem to be concerned "what if sum (memory used by all pools) reaches 100%". And how much memory you need for provisioning in the future.
Yes, the increase of memory with the latest release triggered some discussion about the configuration and the specification of the pool server.
First concern, can you allocate memory to pools so the sum of it is exactly physical node memory size? And then run like that and monitor swap usage. Do you have swap activity.
Second, for instance at Fermilab, the pool nodes we buy have more memory than we ever use:
[root@pubstor2301 ~]# free total used free shared buff/cache available Mem: 263150780 36593124 2243884 26120 228000980 226557656 Swap: 4194300 26880 4167420
Fore the mentioned host, I have:
[root@dc268 ~]# free
total used free shared buff/cache available
Mem: 263009792 245740016 10360080 3972192 6909696 10684272
Swap: 8388604 8019584 369020
And when the pool nodes are bought we get "whatever is similar to what we have, but current gen."
Just real quick look
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1134414 root 20 0 39.3g 6.6g 13416 S 250.0 2.6 46586:59 java 1134551 root 20 0 26.0g 5.1g 15412 S 181.2 2.0 13654:32 java 1134287 root 20 0 39.0g 6.3g 17628 S 12.5 2.5 434:35.21 java 1134348 root 20 0 18.6g 2.1g 18384 S 6.2 0.8 202:57.73 java
Each pool has:
dcache.java.memory.heap=4096m dcache.java.memory.direct=4096m
Are you speaking about virtual memory or resident? As we can see above RS exceeds 4GB by 50% in some cases. Does it create any problems? Does not look like.
Correct. The pools run without errors
Memory has never been a concern on our installation. From time to time we run out of java.memory.heap due to number of files and we typically double
java.memory.heap
when this happens.
I don't think we have observed this
Also!
The numbers I quoted were for our public system that runs 8.2! So even if 8.2 resident memory > sum(heap+direct)
On CMS that runs 9.2.11 (plus some patches - I do not wait for official releases sometime):
root@cmsstor925 ~]# free
total used free shared buff/cache available
Mem: 195994608 44024360 1993640 5008 152738004 151970248
Swap: 4194300 0 4194300
DOMAIN STATUS PID USER LOG
w-cmsstor925-disk-disk1Domain running (for 4 weeks) 25630 root /var/log/dcache/w-cmsstor925-disk-disk1Domain.log
w-cmsstor925-disk-disk2Domain running (for 4 weeks) 25640 root /var/log/dcache/w-cmsstor925-disk-disk2Domain.log
w-cmsstor925-disk-disk3Domain running (for 4 weeks) 25651 root /var/log/dcache/w-cmsstor925-disk-disk3Domain.log
w-cmsstor925-disk-disk4Domain running (for 4 weeks) 25683 root /var/log/dcache/w-cmsstor925-disk-disk4Domain.log
[root@cmsstor925 ~]# ps -o pid,vsz,rss,size -p 25630,25640,25651,25683
PID VSZ RSS SIZE
25630 28873688 9177608 10473540
25640 28537760 8512420 9990132
25651 28563492 7310116 8837064
25683 28612632 8741532 10224020
dcache.java.memory.heap=5096m
dcache.java.memory.direct=5096m
All nodes (on public in the previous message) and thos one run Alma9. So as you can see : 8.2 and 9.2 exhibit the same pattern. Is it a problem? I do not think so.
P.S.: what does strike me as curious. CMS shows 0 swap activity as it should. Why public shows >0 swap.... have no idea. It should not be swapping at all.
@DmitryLitvintsev thank you for your feedback. what is the java version run there
@vingar
your quoted
[root@dc268 ~]# free
total used free shared buff/cache available
Mem: 263009792 245740016 10360080 3972192 6909696 10684272
Swap: 8388604 8019584 369020
This actually does not look right to me. Used 234 GB out of 256 (in our case only 34GB is used out of 250) The sum of memory sizes of pool processes is 122.89 GB. Where is another 110GB went?
In our case 34GB is roughly sum(java process resident size)
@DmitryLitvintsev thank you for your feedback. what is the java version run there
[root@cmsstor925 ~]# java -version
openjdk version "11.0.23" 2024-04-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.23.0.9-2) (build 11.0.23+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.23.0.9-2) (build 11.0.23+9-LTS, mixed mode, sharing)
[root@cmsstor925 ~]#
and in Public:
[root@pubstor2301 ~]# java -version
openjdk version "11.0.22" 2024-01-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS, mixed mode, sharing)
[root@pubstor2301 ~]#
This image shows a server with dCache running and then stopped.
Our java version is the same
[root@dc264 ~]# java -version
openjdk version "11.0.22" 2024-01-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS, mixed mode, sharing)
It seems that after the upgrade the system is using the full dcache.java.memory.heap=6144m dcache.java.memory.direct=6144m
which is reported at aggregated 12.2GB
I looked into other systems less loaded running in 9.2 and the memory the RES 2.7g dcache.java.memory.heap=2048m dcache.java.memory.direct=2048m
root 8782 1.2 0.5 16994924 2846512 ? Ssl Jun02 435:30 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=2048m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,50000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dcqos002nineDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addresses=system --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED --add-opens=java.base/java.math=ALL-UNNAMED --add-opens java.base/sun.nio.fs=ALL-UNNAMED -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start dcqos002nineDomain
You have new mail in /var/spool/mail/root
[root@dcqos002 ~]# top -p 8782
top - 20:33:22 up 568 days, 5:53, 23 users, load average: 0.39, 0.33, 0.46
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.2 sy, 0.0 ni, 99.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 52808425+total, 1738380 free, 76338176 used, 45000768+buff/cache
KiB Swap: 8388604 total, 7877304 free, 511300 used. 45033398+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8782 root 20 0 16.2g 2.7g 23628 S 0.7 0.5 435:31.31 java
Even though the MEMORY allowed is 2G for HEAP and Direct is 2gb the system does not seem to use all.
The example reported here the system is using the 12.2GB what is not clear is why before the upgrade it did not have to use it all.....
@vingar
your quoted
[root@dc268 ~]# free total used free shared buff/cache available Mem: 263009792 245740016 10360080 3972192 6909696 10684272 Swap: 8388604 8019584 369020
This actually does not look right to me. Used 234 GB out of 256 (in our case only 34GB is used out of 250) The sum of memory sizes of pool processes is 122.89 GB. Where is another 110GB went?
In our case 34GB is roughly sum(java process resident size)
@DmitryLitvintsev you got a good point there. Where does the rest of the memory go...
# ps -eo comm,rss | awk '{array[$1]+=$2} END {for (i in array) print i, array[i]/1024 " MB"}' | sort -k2 -nr
java 125745 MB
pmdaproc 461.539 MB
tuned 87.9258 MB
systemd-journal 73.4453 MB
pmdalinux 70.8945 MB
..
I'll follow up at BNL. Thanks for spotting this