dcache JAVA heap memory limits not respected after 9.2.6 upgrade

Dear all,

An increase 13% of the memory usage on the pools is observed on dCache 9.2.6. This is observed in different pools on the system. Here an example

The Pool is enabled with the following values before and after the upgrade.

dcache.java.memory.heap=6144m
dcache.java.memory.direct=6144m

[${host.name}eightDomain]
[${host.name}eightDomain/pool]
pool.name=${host.name}_8
pool.tags=hostname=BCF
pool.path=/data/dcache_pool_8/pool
pool.wait-for-files=${pool.path}/data

When enabling JMX to monitor the JVM this is observed, Screenshot 2024-02-05 at 10 28 25 AM

Other OS level tools shows that the actual memory usage is also higher than what before the 9.2.6 (8.2.X)

It is not clear to me why the limits are not respected. Please note that there are not relevant changes on the OS the java version is the same JAVA 11.

All the best, Carlos

Feb 05 '24 15:02 cfgamboa

Carlos,

Please check domain log. On startup it prints parameters passed to JVM:

2023-05-02 19:49:03 Launching /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m ....

Do you see your setting 6144m passed there?

Feb 05 '24 15:02 DmitryLitvintsev

Hello,

Indeed.

024-02-05 10:28:48 Launching /usr/bin/java -server -Xmx6144m -XX:MaxDirectMemorySize=6144m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dc235eightDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addr

On Feb 5, 2024, at 10:44 AM, Dmitry Litvintsev @.***> wrote:

Carlos,

Please check domain log. On startup it prints parameters passed to JVM:

2023-05-02 19:49:03 Launching /usr/bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m .... Do you see your setting 6144m passed there?

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-1927293020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO264YDOFD4X3V6FPOLYSD47PAVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGI4TGMBSGA. You are receiving this because you authored the thread.

Feb 05 '24 16:02 cfgamboa

FYI for security issues the RT ticket [www.dcache.org #10590] Re: JAVA heap memory limits not respected after 9.2.6 upgrade GitHub issue #7507 is used now as there are java related dumps requested by dCache devs

Feb 16 '24 16:02 cfgamboa

Hi Carlos,

following up on this:

When you say "JAVA heap memory limits not respected", do you mean that JVM uses more memory than specified by -Xmx6144m -XX:MaxDirectMemorySize=6144m ?
Or that pool memory usage is up 10% ?

which is it (1) or (2) or both. Or both? If it is (1) then this is must be JAVA issue. If it is (2) this must be the consequence of creating transfer buffers on startup.

Mar 05 '24 15:03 DmitryLitvintsev

Yes, is more than the specified. See an example here for pools running

See the process

root     271305  0.0  0.0 112808   976 pts/0    S+   09:16   0:00          \_ grep --color=auto 52204
root      52204 14.9  2.8 49247004 11100572 ?   Ssl  Feb09 5496:11 /usr/bin/java -server -Xmx6144m -XX:MaxDirectMemorySize=6144m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dc241eightDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addresses=system --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED --add-opens=java.base/java.math=ALL-UNNAMED --add-opens java.base/sun.nio.fs=ALL-UNNAMED -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start dc241eightDomain

Mar 06 '24 14:03 cfgamboa

The issue might be addressed by commit 8f3b984f

Mar 07 '24 19:03 kofemann

Hello,

Thank you. 9.2.14 was deployed on pools of the integration instance Screenshot 2024-03-08 at 1 23 57 PM

So far it looks promising.

Mar 08 '24 18:03 cfgamboa

Hi Carlos. Interestingly, the potential fix mentioned by Tigran is not yet in 9.2.14, it still needs to be merged.

Mar 08 '24 20:03 lemora

Hi Thank you for clarifying.

For now let us see how the memory evolves overtime.

All the best, Carlos

On Mar 8, 2024, at 3:23 PM, Lea @.***> wrote:

Hi Carlos. Interestingly, the potential fix mentioned by Tigran is not yet in 9.2.14, it still needs to be merged.

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-1986366538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO4DTOMOTTFYWDOWM7LYXIM3ZAVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGM3DMNJTHA. You are receiving this because you authored the thread.

Mar 08 '24 20:03 cfgamboa

I took one of the recently decommissioned pool host and upgrade it to 9.2.15 from 9.2.6.

It looks better

Mar 27 '24 14:03 cfgamboa

Upgraded a production host with 15 pools with dcache 9.2.15

Apr 10 '24 19:04 cfgamboa

@cfgamboa Is there any updates on this? Do you still have this issue, and did pools fail with Out-of-Memmroy exceptions?

Jun 25 '24 14:06 kofemann

Hello,

No the pools did not crashed with Out of Memory executions not observed so far.

Nothing has changed to what is already reported.

All the best, Carlos

On Jun 25, 2024, at 10:41 AM, Tiramisu Mokka @.***> wrote:

@cfgamboa https://github.com/cfgamboa Is there any updates on this? Do you still have this issue, and did pools fail with Out-of-Memmroy exceptions?

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7507#issuecomment-2189150582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO5WX6XLEAXX2L2GAX3ZJF6Q7AVCNFSM6AAAAABC2LOPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGE2TANJYGI. You are receiving this because you were mentioned.

Jun 25 '24 14:06 cfgamboa

If pools don't crash, why is the issue?

Jun 25 '24 17:06 DmitryLitvintsev

The issue is the high increase on memory usage after deploying 9.2. There is a 13% increase of memory as it was initially reported at the beginning of the ticket. Since then In some cases we have observed less than 3% free memory available to the system.

While the pools are not crashing the memory used by dCache was substantially increased after the upgrade. This is https://github.com/dCache/dcache/issues/7507#issuecomment-1980971444 an example of the memory used on the server and the memory allocated enable in the process.

All the best, Carlos

Jun 26 '24 13:06 cfgamboa

Hi Carlos.

Following up from the Tier 1, is it fair to say that you would be satisfied with a specific reason on our side why the required pool memory has increased? And that the reason for your concern are a matter of storage budget, meaning that you want us to acknowledge that we know of the change? Of course it would be preferable to all if the memory increase went away, but the situation is difficult to analyse and is unlikely to be resolved with a quick fix.

Kind regards Lea

Jun 26 '24 13:06 lemora

Hi Lea,

If the cause of the increase in memory usage is not known yet and is under review by you, this is fair enough; at least I know you are aware of it. If this is expected, then of course we will need to take it into account for future requisitions.

Currently, the criticality is not high as the service is not disrupted by out-of-memory errors.

All the best, Carlos

Jun 26 '24 14:06 cfgamboa

With these parameters

dcache.java.memory.heap=6144m
dcache.java.memory.direct=6144m

The maximum memory usage for a Java process is the sum of the two, I think?:

6144 MB (heap)+6144 MB (direct)=12288 MB=12.288 GB

One of our data servers shows the following per pool process:

Domain                Memory usage
dc268_1-Domain        13.047431945800781 GB
dc268_10-Domain       3.1094741821289062 GB
dc268_11-Domain       1.9457817077636719 GB
dc268_12-Domain       3.3018569946289062 GB
dc268_2-Domain        12.515079498291016 GB
dc268_3-Domain        13.274280548095703 GB
dc268_4-Domain        12.742572784423828 GB
dc268_5-Domain        12.708694458007812 GB
dc268_6-Domain        12.807502746582031 GB
dc268_7-Domain        11.783977508544922 GB
dc268_8-Domain        12.699535369873047 GB
dc268_9-Domain        12.95401382446289 GB

In some cases, the memory usage exceeds the value specified by the combined heap and direct memory settings. This is what Carlo reported, as it was not the case before 9.* release series.

The memory usage is important for resource provisioning. More memory seems to be needed for next purchase

Jun 26 '24 15:06 vingar

With these parameters
dcache.java.memory.heap=6144m
dcache.java.memory.direct=6144m
The maximum memory usage for a Java process is the sum of the two, I think?:

6144 MB (heap)+6144 MB (direct)=12288 MB=12.288 GB

One of our data servers shows the following per pool process:
Domain                Memory usage
dc268_1-Domain        13.047431945800781 GB
dc268_10-Domain       3.1094741821289062 GB
dc268_11-Domain       1.9457817077636719 GB
dc268_12-Domain       3.3018569946289062 GB
dc268_2-Domain        12.515079498291016 GB
dc268_3-Domain        13.274280548095703 GB
dc268_4-Domain        12.742572784423828 GB
dc268_5-Domain        12.708694458007812 GB
dc268_6-Domain        12.807502746582031 GB
dc268_7-Domain        11.783977508544922 GB
dc268_8-Domain        12.699535369873047 GB
dc268_9-Domain        12.95401382446289 GB
In some cases, the memory usage exceeds the value specified by the combined heap and direct memory settings. This is what Carlo reported, as it was not the case before 9.* release series.

The memory usage is important for resource provisioning. More memory seems to be needed for next purchase

This is more complicated , see : https://stackoverflow.com/questions/15282178/java-using-up-far-more-memory-than-allocated-with-xmx

You seem to be concerned "what if sum (memory used by all pools) reaches 100%". And how much memory you need for provisioning in the future.

First concern, can you allocate memory to pools so the sum of it is exactly physical node memory size? And then run like that and monitor swap usage. Do you have swap activity.

Second, for instance at Fermilab, the pool nodes we buy have more memory than we ever use:

[root@pubstor2301 ~]# free
               total        used        free      shared  buff/cache   available
Mem:       263150780    36593124     2243884       26120   228000980   226557656
Swap:        4194300       26880     4167420

And when the pool nodes are bought we get "whatever is similar to what we have, but current gen."

Just real quick look

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                           
1134414 root      20   0   39.3g   6.6g  13416 S 250.0   2.6  46586:59 java                                                                                              
1134551 root      20   0   26.0g   5.1g  15412 S 181.2   2.0  13654:32 java                                                                                              
1134287 root      20   0   39.0g   6.3g  17628 S  12.5   2.5 434:35.21 java                                                                                                                                                                             
1134348 root      20   0   18.6g   2.1g  18384 S   6.2   0.8 202:57.73 java

Each pool has:

dcache.java.memory.heap=4096m
dcache.java.memory.direct=4096m

Are you speaking about virtual memory or resident? As we can see above RS exceeds 4GB by 50% in some cases. Does it create any problems? Does not look like.

Memory has never been a concern on our installation. From time to time we run out of java.memory.heap due to number of files and we typically double java.memory.heap when this happens.

Jun 26 '24 15:06 DmitryLitvintsev

This is more complicated , see : https://stackoverflow.com/questions/15282178/java-using-up-far-more-memory-than-allocated-with-xmx

Thanks

You seem to be concerned "what if sum (memory used by all pools) reaches 100%". And how much memory you need for provisioning in the future.

Yes, the increase of memory with the latest release triggered some discussion about the configuration and the specification of the pool server.

First concern, can you allocate memory to pools so the sum of it is exactly physical node memory size? And then run like that and monitor swap usage. Do you have swap activity.

Second, for instance at Fermilab, the pool nodes we buy have more memory than we ever use:
[root@pubstor2301 ~]# free
               total        used        free      shared  buff/cache   available
Mem:       263150780    36593124     2243884       26120   228000980   226557656
Swap:        4194300       26880     4167420

Fore the mentioned host, I have:

[root@dc268 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      263009792   245740016    10360080     3972192     6909696    10684272
Swap:       8388604     8019584      369020

And when the pool nodes are bought we get "whatever is similar to what we have, but current gen."

Just real quick look

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                           
1134414 root      20   0   39.3g   6.6g  13416 S 250.0   2.6  46586:59 java                                                                                              
1134551 root      20   0   26.0g   5.1g  15412 S 181.2   2.0  13654:32 java                                                                                              
1134287 root      20   0   39.0g   6.3g  17628 S  12.5   2.5 434:35.21 java                                                                                                                                                                             
1134348 root      20   0   18.6g   2.1g  18384 S   6.2   0.8 202:57.73 java

Each pool has:

dcache.java.memory.heap=4096m
dcache.java.memory.direct=4096m

Are you speaking about virtual memory or resident? As we can see above RS exceeds 4GB by 50% in some cases. Does it create any problems? Does not look like.

Correct. The pools run without errors

Memory has never been a concern on our installation. From time to time we run out of java.memory.heap due to number of files and we typically double java.memory.heap when this happens.

I don't think we have observed this

Jun 26 '24 15:06 vingar

Also!

The numbers I quoted were for our public system that runs 8.2! So even if 8.2 resident memory > sum(heap+direct)

On CMS that runs 9.2.11 (plus some patches - I do not wait for official releases sometime):

root@cmsstor925 ~]# free
               total        used        free      shared  buff/cache   available
Mem:       195994608    44024360     1993640        5008   152738004   151970248
Swap:        4194300           0     4194300

DOMAIN                        STATUS                PID   USER LOG                                               
w-cmsstor925-disk-disk1Domain running (for 4 weeks) 25630 root /var/log/dcache/w-cmsstor925-disk-disk1Domain.log 
w-cmsstor925-disk-disk2Domain running (for 4 weeks) 25640 root /var/log/dcache/w-cmsstor925-disk-disk2Domain.log 
w-cmsstor925-disk-disk3Domain running (for 4 weeks) 25651 root /var/log/dcache/w-cmsstor925-disk-disk3Domain.log 
w-cmsstor925-disk-disk4Domain running (for 4 weeks) 25683 root /var/log/dcache/w-cmsstor925-disk-disk4Domain.log

[root@cmsstor925 ~]#  ps -o pid,vsz,rss,size -p 25630,25640,25651,25683
  PID    VSZ   RSS  SIZE
25630 28873688 9177608 10473540
25640 28537760 8512420 9990132
25651 28563492 7310116 8837064
25683 28612632 8741532 10224020

dcache.java.memory.heap=5096m
dcache.java.memory.direct=5096m

All nodes (on public in the previous message) and thos one run Alma9. So as you can see : 8.2 and 9.2 exhibit the same pattern. Is it a problem? I do not think so.

P.S.: what does strike me as curious. CMS shows 0 swap activity as it should. Why public shows >0 swap.... have no idea. It should not be swapping at all.

Jun 26 '24 16:06 DmitryLitvintsev

@DmitryLitvintsev thank you for your feedback. what is the java version run there

Jun 26 '24 17:06 cfgamboa

@vingar

your quoted

[root@dc268 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      263009792   245740016    10360080     3972192     6909696    10684272
Swap:       8388604     8019584      369020

This actually does not look right to me. Used 234 GB out of 256 (in our case only 34GB is used out of 250) The sum of memory sizes of pool processes is 122.89 GB. Where is another 110GB went?

In our case 34GB is roughly sum(java process resident size)

Jun 26 '24 17:06 DmitryLitvintsev

@DmitryLitvintsev thank you for your feedback. what is the java version run there

[root@cmsstor925 ~]# java -version 
openjdk version "11.0.23" 2024-04-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.23.0.9-2) (build 11.0.23+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.23.0.9-2) (build 11.0.23+9-LTS, mixed mode, sharing)
[root@cmsstor925 ~]#

and in Public:

[root@pubstor2301 ~]# java -version
openjdk version "11.0.22" 2024-01-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS, mixed mode, sharing)
[root@pubstor2301 ~]#

Jun 26 '24 17:06 DmitryLitvintsev

This image shows a server with dCache running and then stopped.

Jun 26 '24 22:06 cfgamboa

Our java version is the same

[root@dc264 ~]# java -version
openjdk version "11.0.22" 2024-01-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS, mixed mode, sharing)

Jun 26 '24 22:06 cfgamboa

It seems that after the upgrade the system is using the full dcache.java.memory.heap=6144m dcache.java.memory.direct=6144m

which is reported at aggregated 12.2GB

I looked into other systems less loaded running in 9.2 and the memory the RES 2.7g dcache.java.memory.heap=2048m dcache.java.memory.direct=2048m

root      8782  1.2  0.5 16994924 2846512 ?    Ssl  Jun02 435:30 /usr/bin/java -server -Xmx2048m -XX:MaxDirectMemorySize=2048m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,50000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33315 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/dcqos002nineDomain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.9.1.jar -Djava.net.preferIPv6Addresses=system --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED --add-opens=java.base/java.math=ALL-UNNAMED --add-opens java.base/sun.nio.fs=ALL-UNNAMED -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start dcqos002nineDomain
You have new mail in /var/spool/mail/root
[root@dcqos002 ~]# top -p 8782

top - 20:33:22 up 568 days,  5:53, 23 users,  load average: 0.39, 0.33, 0.46
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.2 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52808425+total,  1738380 free, 76338176 used, 45000768+buff/cache
KiB Swap:  8388604 total,  7877304 free,   511300 used. 45033398+avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                 
 8782 root      20   0   16.2g   2.7g  23628 S   0.7  0.5 435:31.31 java

Even though the MEMORY allowed is 2G for HEAP and Direct is 2gb the system does not seem to use all.

The example reported here the system is using the 12.2GB what is not clear is why before the upgrade it did not have to use it all.....

Jun 27 '24 00:06 cfgamboa

@vingar

your quoted
[root@dc268 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      263009792   245740016    10360080     3972192     6909696    10684272
Swap:       8388604     8019584      369020
This actually does not look right to me. Used 234 GB out of 256 (in our case only 34GB is used out of 250) The sum of memory sizes of pool processes is 122.89 GB. Where is another 110GB went?

In our case 34GB is roughly sum(java process resident size)

@DmitryLitvintsev you got a good point there. Where does the rest of the memory go...

# ps -eo comm,rss | awk '{array[$1]+=$2} END {for (i in array) print i, array[i]/1024 " MB"}' | sort -k2 -nr
java 125745 MB
pmdaproc 461.539 MB
tuned 87.9258 MB
systemd-journal 73.4453 MB
pmdalinux 70.8945 MB
..

I'll follow up at BNL. Thanks for spotting this

Jun 27 '24 07:06 vingar

dcache dcache copied to clipboard

JAVA heap memory limits not respected after 9.2.6 upgrade

dcache
dcache copied to clipboard