[Bug] - Degraded Performance on AL2023 comparing to AL2
Describe the bug CPU increase on AL2023.
We are in the works of migrate to AL2023.
InstanceType: c8g.48xlarge In these graphs you can spot the increase of CPU and Latency the other server are running AL2.
Kernel 6.1:
Kernel 6.12
The all server are setup with salt and the only thing we have done to the salt code is so we can provision AL2023.
Our application runs on Java. The Java version have not been changed.
What more information do you need from us?
The performance regression is driven by NUMA auto-balancing changes in AL2023 kernels (6.1 & 6.12). AL2 → JVMs mostly local, low LLC misses, better perf. AL2023 → aggressive NUMA balancing spreads workload → remote memory, cache miss storms, slower perf.
We will try to add -XX:+UseNUMA to our java processes. We run four similar size java processes and one smaller one: 75 GB 70 GB 80 GB 65 GB 6,5 GB
I have tried to:
- Disable NUMA balancing system-wide echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf sudo sysctl --system
Verify: cat /proc/sys/kernel/numa_balancing
should return 0
-
Restart the whole EC2 with all five JVM with -XX:+UseNUMA -XX:+UseTransparentHugePages
-
Verify JVM NUMA policy Check if the heap is now biased towards a single node: numastat -p
But I only see effect of UseNUMA in the smallest JVM process, all the others are ~50/50 between NUMA 0 and 1.
We are using /opt/openjdk/corretto-22.0.2.9.1 and we pre allocate most memory: ]# ps -ef | grep java oAFD3 13985 1 99 13:04 ? 05:10:27 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx75161927680 -Xms75161927680 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eu3-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/oAFD3/prod -nodetach -useTdml -noConsoleLog o57A2 13986 1 99 13:04 ? 05:48:36 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx85899345920 -Xms85899345920 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eu4-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o57A2/prod -nodetach -useTdml -noConsoleLog oDF7C 13987 1 99 13:04 ? 05:13:53 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx69793218560 -Xms69793218560 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-seu,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/oDF7C/prod -nodetach -useTdml -noConsoleLog o2F54 13988 1 99 13:04 ? 01:01:04 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx6979321856 -Xms6979321856 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-weekday-e4,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+HeapDumpOnOutOfMemoryError -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o2F54/prod -nodetach -useTdml -noConsoleLog o5776 13989 1 99 13:04 ? 05:59:20 /opt/openjdk/corretto-22.0.2.9.1/bin/java -Xmx80530636800 -Xms80530636800 -Dlogback.configurationFile=conf/logback.xml -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Xlog:gc*:log/gc.log::filecount=10,filesize=5M -XX:+UseTransparentHugePages -javaagent:/opt/apptus/otel/otel-javaagent.jar -Dotel.resource.attributes=service.name=quattro,application.name=elevate,tenant.purpose=prod,tenant.owner.type=customer,tenant.org=hmgroup-hm-eeu,tenant.env=prod,realm=main,version=null,team=rp -Dotel.exporter.otlp.endpoint=http://localhost:4317 -Dotel.exporter.otlp.protocol=grpc -XX:+AlwaysPreTouch -Djdk.xml.entityExpansionLimit=64000 -Djdk.xml.totalEntitySizeLimit=50000000 -Djdk.xml.maxGeneralEntitySizeLimit=0 -Djdk.xml.maxParameterEntitySizeLimit=1000000 -Djdk.xml.entityReplacementLimit=3000000 -Djdk.xml.maxElementDepth=0 -Djdk.xml.elementAttributeLimit=10000 -Djavax.net.ssl.keyStore=/opt/apptus/certificate/graylog-client.jks -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.keyStoreType=JKS -XX:+UseNUMA -cp /opt/apptus/esales-server/esales-server-4.1.2708-jar-with-dependencies.jar com.apptus.esales.query_processor.Start -d /opt/apptus/organizations/o5776/prod -nodetach -useTdml -noConsoleLog
cat */11_numastat.txt
numastat -p 13985
Per-node process memory usage (in MBs) for PID 13985 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 38220.34 37535.79 75756.14
Total 38220.53 37535.79 75756.32
numastat -p 13986
Per-node process memory usage (in MBs) for PID 13986 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 44061.82 42618.19 86680.01
Total 44062.01 42618.19 86680.20
numastat -p 13987
Per-node process memory usage (in MBs) for PID 13987 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.16 0.00 0.16 Stack 0.03 0.00 0.03 Private 35751.52 35319.43 71070.94
Total 35751.70 35319.43 71071.13
numastat -p 13988
Per-node process memory usage (in MBs) for PID 13988 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.16 0.16 Stack 0.00 0.03 0.03 Private 4758.18 4641.52 9399.71
Total 4758.18 4641.71 9399.89
numastat -p 13989
Per-node process memory usage (in MBs) for PID 13989 (java) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.16 0.16 Stack 0.00 0.03 0.03 Private 40298.90 41070.80 81369.70
Total 40298.90 41070.98 81369.88
cat */12_perf_mem.txt
Performance counter stats for process id '13985':
741,805,209,598 cycles 684,431,750,205 instructions # 0.92 insn per cycle 7,958,849,218 LLC-loads 7,340,093,294 LLC-load-misses # 92.23% of all LL-cache accesses
10.021656850 seconds time elapsed
Performance counter stats for process id '13986':
468,274,400,420 cycles 488,702,924,430 instructions # 1.04 insn per cycle 5,655,949,345 LLC-loads 5,146,972,934 LLC-load-misses # 91.00% of all LL-cache accesses
10.023732665 seconds time elapsed
Performance counter stats for process id '13987':
650,123,353,775 cycles 669,313,457,515 instructions # 1.03 insn per cycle 6,634,733,132 LLC-loads 6,086,817,574 LLC-load-misses # 91.74% of all LL-cache accesses
10.022433595 seconds time elapsed
Performance counter stats for process id '13988':
71,974,834,667 cycles
114,500,911,595 instructions # 1.59 insn per cycle 574,498,514 LLC-loads 491,135,505 LLC-load-misses # 85.49% of all LL-cache accesses
10.017295038 seconds time elapsed
Performance counter stats for process id '13989':
782,817,025,255 cycles 742,174,296,220 instructions # 0.95 insn per cycle 7,080,411,442 LLC-loads 6,356,678,320 LLC-load-misses # 89.78% of all LL-cache accesses
10.023307032 seconds time elapsed
Please advice how to configure the OS and JVM to behave at least as good as in AL2, preferably better.
Should we try out using smaller instances with less vCPU than 192? Or, should we try to run m8g.48xlarge or even r8g.48xlarge to have more memory for the java processes to spread into?
This is a blocker for us, I would appreciate if you could come back with an answer. We are currently preparing for the increased load during the Black month and winter season. We also wanted to move away from a root account to have several other accounts, like in AWS Well-Architected Framework, but we want do that until we have a proper solution.
Br, UIf
Thanks for reporting. I'll try to reproduce this on our side. Can you share a bit about what your java application is doing so that I can come up with a local reproducer?
Sorry for late reply we have been occupied.
It is a Java application, serving as a back-end for eCommerce shops. Queries and imports are sent using HTTP requests. Our API have endpoints for:
- storefront requests: Queries (Auto-complete, Search, Navigation, Landing Page/Product listings, Product Page, Cart Page etc.) and Notifications (Click, Add-to-cart etc.)
- admin requests: Imports (Catalog, Navigation, Pages etc) and Exports The API connects to a cluster having multiple nodes distributed over multiple hosts in multiple availability zones. The application we have issues with is the "query processor". It is self developed and uses our own algorithms, with data structures hosted in memory, and must respond in milliseconds to all queries. We have customers with Heaps up to 380 GB. On a host serving multiple cluster such Java applications (the query processors), we have no issues running on Amazon Linux 2, but when we change to Amazon Linux 2023, the CPU Usage is doubled. Example, using a c8g.48xlarge EC2 host: • “Cluster1”, 75GB heap, ca 500 QPS • “Cluster2”, 80GB heap, ca 300 QPS • "Cluster3”, 70GB heap, ca 200 QPS • “Cluster4”, 65GB heap, ca 800 QPS • "Cluster5”, 7GB heap, ca 130 QPS In parallel with the above queries, imports are running every 5:e minutes. In the above scenario the AL2023, 6.1-host is using ~60% CPU when no updates are running, and ~80% during updates. Another host in the same cluster, receiving the same traffic and the same updates, uses ~30% CPU when no updates are running, and ~40% during updates.
Understood. Is right now the concern only the fact that CPU usage increases or are you also seeing latency issues already?
I've so far managed to simulate a Java application with memory allocations similar to your setup. I'm running random memory accesses across the heap area for a while. I'm seeing NUMA balancing being turned off in kernel 6.1, but apart from that not seeing differences in CPU utilization, so I'm probably lacking something important.
I see you ran an experiment with kernel.numa_balancing=0 above, but that would already be the default setting for 6.1. Can you by chance try with kernel.numa_balancing=1?
@UlfBlom is currently on vacation he have been fiddling the most with the kernel settings. But I think the numa_balacing=1 is the default and since Ulf tried to set it to 0 I suspect we had it from the start.
Yes, it's the default on AL2 kernels until 5.10. The default has changed to 0 in kernel 5.15 and later.