X-Road icon indicating copy to clipboard operation
X-Road copied to clipboard

Performance issues with message logging

Open autero1 opened this issue 2 years ago • 3 comments

Version: 7.0.1 Environment: Kubernetes (AWS EKS) External DB: AWS RDS Aurora PostgreSQL

Recently we started noticing serious performance issues with message logging functionality where the xroad-proxy process continuously crashes with OOM error:

2022-07-07T00:35:44.889Z INFO [xroad-proxy] [Proxy-akka.actor.default-dispatcher-9] e.r.xroad.proxy.messagelog.TaskQueue - Start time-stamping 1 message records
...
2022-07-07T00:35:50.285Z INFO [xroad-proxy] [Proxy-akka.actor.default-dispatcher-9] e.r.x.p.m.AbstractTimestampRequest - tsresponse org.bouncycastle.tsp.TimeStampResponse@1d72e085
2022-07-07T00:35:52.963Z ERROR [xroad-proxy] [Proxy-akka.actor.default-dispatcher-26] akka.actor.ActorSystemImpl - Uncaught error from thread [Proxy-akka.actor.default-dispatcher-9]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Proxy]
  java.lang.OutOfMemoryError: Java heap space
  	at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
  	at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
  	at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:686)
2022-07-07T00:35:52.979Z INFO [xroad-proxy] [Thread-3] ee.ria.xroad.proxy.ProxyMain - Proxy shutting down...

We increased the xroad-proxy heap up to 2G to no avail. After further investigation, we noticed that the message timestamping and related DB queries to the logrecord table were taking up considerable time:

image

We're using default values for retention and message body logging (30 and true). The approximate row count in the logrecord table was ~127k. Increasing the DB cluster size or the xroad-proxy heap didn't really help. What did eventually help was truncating the message log - that resulted in:

  • Timestamping times dropped from ~20 secs to < 100ms
  • Proxy heap usage dropped considerably
  • No more OOMs
  • Database connectivity, DB load and general performance hugely improved

Prior to truncating the table, the system was virtually inoperable.

Any idea why the timestamping queries took so long - is there an issue with indices in the table? How do we turn off message logging completely? We already reduced retention to 1d and skipped body logging --> hugely improved performance.

Relates to #1292.

autero1 avatar Jul 07 '22 07:07 autero1

Hello @autero1,

I investigated it a bit. I think there are 2 separate issues:

  1. LOGRECORD table does not have an index suitable for the request composed here https://github.com/nordic-institute/X-Road/blob/6c04d5258b836bffc3175217c9cf1295540cff93/src/addons/messagelog/messagelog-addon/src/main/java/ee/ria/xroad/proxy/messagelog/LogRecordManager.java#L172-L174 . Currently only logrecordpk primary key constraint is used. It may affect the performance. As it is only ~127000, it shouldn't.
  2. messagelog-addon is implemented synchronously as Scala's Await.result is used that blocks a thread until the timestamp provider (xroad-signer) responds. https://github.com/nordic-institute/X-Road/blob/28d16c08d44cc39dcf8b72334a9dcdba035c11d4/src/addons/messagelog/messagelog-addon/src/main/java/ee/ria/xroad/proxy/messagelog/LogManager.java#L206

Regarding #⁠2, how much traffic is going through the particular Security Server?

The DB update execution should not cause the heap increase. But, the provided reference sounds odd. I'd rather think that waiting for the response from the signer blocks so many threads that it inflicts the proxy processing.

vellotis avatar Jul 07 '22 11:07 vellotis

Regarding #⁠2, how much traffic is going through the particular Security Server?

The ~127k rows in the logrecord table for 30 days is fairly indicative of the traffic. So not very high traffic volume. Just a few thousand reqs / day.

On a side note, the servers / mem usage, DB connections all stabilized after truncating the table. Memory usage for the k8s Pod dropped at least 50%. Go figure 🤷

autero1 avatar Jul 08 '22 04:07 autero1

One way to disable message logging completely is to use the slim variant of the Security Server Sidecar Docker image. In your case, the correct variant would be 7.1.0-slim-fi or 7.1.0-slim-primary-fi/ 7.1.0-slim-secondary-fi if you're running a Security Server cluster. The differences between different variants are described in Docker Hub in the available images section. Disabling message logging should have a quite significant effect on the performance. When message logging is disabled, timestamping is also disabled.

The downside of the slim image variant is that it doesn't include environmental and operational monitoring either. It means that collecting service usage statistics is pretty challenging. The 7.1.0-fi variant of the Sidecar includes environmental monitoring, operational monitoring and message log. Another alternative would be to use the 7.1.0-fi variant and disable message log add-on manually. It can be done by renaming the /usr/share/xroad/jlib/addon/proxy/messagelog.conf file and restarting the xroad-proxy service inside the container. However, the downside is that the change has to be done every time when a new container is created.

mv /usr/share/xroad/jlib/addon/proxy/messagelog.conf /usr/share/xroad/jlib/addon/proxy/messagelog.conf.disabled

By the way, in the -fi image variants message body logging is false by default. If it was true in your configuration by default, you're probably using the vanilla variant of the image. For the Finnish Palveluväylä users it's strongly recommended to use the -fi variants since they contain some Palveluväylä specific overrides to the default configuration.

petkivim avatar Jul 14 '22 06:07 petkivim