amazon-linux-2023 icon indicating copy to clipboard operation
amazon-linux-2023 copied to clipboard

[Bug] - Degraded Performance on AL2023 comparing to AL2

Open m4djack opened this issue 4 months ago • 7 comments

Describe the bug CPU increase on AL2023.

We are in the works of migrate to AL2023.

InstanceType: c8g.48xlarge In these graphs you can spot the increase of CPU and Latency the other server are running AL2.

Kernel 6.1:

Image Image

Kernel 6.12

Image Image

The all server are setup with salt and the only thing we have done to the salt code is so we can provision AL2023.

Our application runs on Java. The Java version have not been changed.

What more information do you need from us?

m4djack avatar Sep 16 '25 15:09 m4djack

The performance regression is driven by NUMA auto-balancing changes in AL2023 kernels (6.1 & 6.12). AL2 → JVMs mostly local, low LLC misses, better perf. AL2023 → aggressive NUMA balancing spreads workload → remote memory, cache miss storms, slower perf.

We will try to add -XX:+UseNUMA to our java processes. We run four similar size java processes and one smaller one: 75 GB 70 GB 80 GB 65 GB 6,5 GB

UlfBlom avatar Sep 17 '25 07:09 UlfBlom

I have tried to:

  1. Disable NUMA balancing system-wide echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf sudo sysctl --system

Verify: cat /proc/sys/kernel/numa_balancing

should return 0

  1. Restart the whole EC2 with all five JVM with -XX:+UseNUMA -XX:+UseTransparentHugePages

  2. Verify JVM NUMA policy Check if the heap is now biased towards a single node: numastat -p

But I only see effect of UseNUMA in the smallest JVM process, all the others are ~50/50 between NUMA 0 and 1.

We are using /opt/openjdk/corretto-22.0.2.9.1 and we pre allocate most memory: ]# ps -ef | grep java oAFD3 13985 1 99 13:04 ? 05:10:27 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx75161927680 -Xms75161927680 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eu3-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/oAFD3/prod -nodetach -useTdml -noConsoleLog o57A2 13986 1 99 13:04 ? 05:48:36 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx85899345920 -Xms85899345920 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eu4-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o57A2/prod -nodetach -useTdml -noConsoleLog oDF7C 13987 1 99 13:04 ? 05:13:53 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx69793218560 -Xms69793218560 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-seu,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/oDF7C/prod -nodetach -useTdml -noConsoleLog o2F54 13988 1 99 13:04 ? 01:01:04 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx6979321856 -Xms6979321856 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-weekday-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+HeapDumpOnOutOfMemoryError -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o2F54/prod -nodetach -useTdml -noConsoleLog o5776 13989 1 99 13:04 ? 05:59:20 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx80530636800 -Xms80530636800 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eeu,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o5776/prod -nodetach -useTdml -noConsoleLog

cat */11_numastat.txt

numastat -p 13985

Per-node process memory usage (in MBs) for PID 13985 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 38220.34 37535.79 75756.14


Total 38220.53 37535.79 75756.32

numastat -p 13986

Per-node process memory usage (in MBs) for PID 13986 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 44061.82 42618.19 86680.01


Total 44062.01 42618.19 86680.20

numastat -p 13987

Per-node process memory usage (in MBs) for PID 13987 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 35751.52 35319.43 71070.94


Total 35751.70 35319.43 71071.13

numastat -p 13988

Per-node process memory usage (in MBs) for PID 13988 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.16 0.16 Stack 0.00 0.03 0.03 Private 4758.18 4641.52 9399.71


Total 4758.18 4641.71 9399.89

numastat -p 13989

Per-node process memory usage (in MBs) for PID 13989 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.16 0.16 Stack 0.00 0.03 0.03 Private 40298.90 41070.80 81369.70


Total 40298.90 41070.98 81369.88

cat */12_perf_mem.txt

Performance counter stats for process id '13985':

741,805,209,598 cycles 684,431,750,205 instructions # 0.92 insn per cycle 7,958,849,218 LLC-loads 7,340,093,294 LLC-load-misses # 92.23% of all LL-cache accesses

  10.021656850 seconds time elapsed

Performance counter stats for process id '13986':

468,274,400,420 cycles 488,702,924,430 instructions # 1.04 insn per cycle 5,655,949,345 LLC-loads 5,146,972,934 LLC-load-misses # 91.00% of all LL-cache accesses

  10.023732665 seconds time elapsed

Performance counter stats for process id '13987':

650,123,353,775 cycles 669,313,457,515 instructions # 1.03 insn per cycle 6,634,733,132 LLC-loads 6,086,817,574 LLC-load-misses # 91.74% of all LL-cache accesses

  10.022433595 seconds time elapsed

Performance counter stats for process id '13988':

71,974,834,667      cycles

114,500,911,595 instructions # 1.59 insn per cycle 574,498,514 LLC-loads 491,135,505 LLC-load-misses # 85.49% of all LL-cache accesses

  10.017295038 seconds time elapsed

Performance counter stats for process id '13989':

782,817,025,255 cycles 742,174,296,220 instructions # 0.95 insn per cycle 7,080,411,442 LLC-loads 6,356,678,320 LLC-load-misses # 89.78% of all LL-cache accesses

  10.023307032 seconds time elapsed

Please advice how to configure the OS and JVM to behave at least as good as in AL2, preferably better.

Should we try out using smaller instances with less vCPU than 192? Or, should we try to run m8g.48xlarge or even r8g.48xlarge to have more memory for the java processes to spread into?

This is a blocker for us, I would appreciate if you could come back with an answer. We are currently preparing for the increased load during the Black month and winter season. We also wanted to move away from a root account to have several other accounts, like in AWS Well-Architected Framework, but we want do that until we have a proper solution.

Br, UIf

UlfBlom avatar Sep 18 '25 14:09 UlfBlom

Thanks for reporting. I'll try to reproduce this on our side. Can you share a bit about what your java application is doing so that I can come up with a local reproducer?

bjoernd avatar Sep 19 '25 05:09 bjoernd

Sorry for late reply we have been occupied.

It is a Java application, serving as a back-end for eCommerce shops. Queries and imports are sent using HTTP requests. Our API have endpoints for:

  • storefront requests: Queries (Auto-complete, Search, Navigation, Landing Page/Product listings, Product Page, Cart Page etc.) and Notifications (Click, Add-to-cart etc.)
  • admin requests: Imports (Catalog, Navigation, Pages etc) and Exports The API connects to a cluster having multiple nodes distributed over multiple hosts in multiple availability zones. The application we have issues with is the "query processor". It is self developed and uses our own algorithms, with data structures hosted in memory, and must respond in milliseconds to all queries. We have customers with Heaps up to 380 GB. On a host serving multiple cluster such Java applications (the query processors), we have no issues running on Amazon Linux 2, but when we change to Amazon Linux 2023, the CPU Usage is doubled. Example, using a c8g.48xlarge EC2 host: • “Cluster1”, 75GB heap, ca 500 QPS • “Cluster2”, 80GB heap, ca 300 QPS • "Cluster3”, 70GB heap, ca 200 QPS • “Cluster4”, 65GB heap, ca 800 QPS • "Cluster5”, 7GB heap, ca 130 QPS In parallel with the above queries, imports are running every 5:e minutes. In the above scenario the AL2023, 6.1-host is using ~60% CPU when no updates are running, and ~80% during updates. Another host in the same cluster, receiving the same traffic and the same updates, uses ~30% CPU when no updates are running, and ~40% during updates.

m4djack avatar Sep 30 '25 10:09 m4djack

Understood. Is right now the concern only the fact that CPU usage increases or are you also seeing latency issues already?

I've so far managed to simulate a Java application with memory allocations similar to your setup. I'm running random memory accesses across the heap area for a while. I'm seeing NUMA balancing being turned off in kernel 6.1, but apart from that not seeing differences in CPU utilization, so I'm probably lacking something important.

I see you ran an experiment with kernel.numa_balancing=0 above, but that would already be the default setting for 6.1. Can you by chance try with kernel.numa_balancing=1?

bjoernd avatar Sep 30 '25 19:09 bjoernd

@UlfBlom is currently on vacation he have been fiddling the most with the kernel settings. But I think the numa_balacing=1 is the default and since Ulf tried to set it to 0 I suspect we had it from the start.

m4djack avatar Oct 01 '25 08:10 m4djack

Yes, it's the default on AL2 kernels until 5.10. The default has changed to 0 in kernel 5.15 and later.

bjoernd avatar Oct 01 '25 09:10 bjoernd