X-Road
X-Road copied to clipboard
Performance issues with message logging
Version: 7.0.1 Environment: Kubernetes (AWS EKS) External DB: AWS RDS Aurora PostgreSQL
Recently we started noticing serious performance issues with message logging functionality where the xroad-proxy
process continuously crashes with OOM error:
2022-07-07T00:35:44.889Z INFO [xroad-proxy] [Proxy-akka.actor.default-dispatcher-9] e.r.xroad.proxy.messagelog.TaskQueue - Start time-stamping 1 message records
...
2022-07-07T00:35:50.285Z INFO [xroad-proxy] [Proxy-akka.actor.default-dispatcher-9] e.r.x.p.m.AbstractTimestampRequest - tsresponse org.bouncycastle.tsp.TimeStampResponse@1d72e085
2022-07-07T00:35:52.963Z ERROR [xroad-proxy] [Proxy-akka.actor.default-dispatcher-26] akka.actor.ActorSystemImpl - Uncaught error from thread [Proxy-akka.actor.default-dispatcher-9]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Proxy]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:686)
2022-07-07T00:35:52.979Z INFO [xroad-proxy] [Thread-3] ee.ria.xroad.proxy.ProxyMain - Proxy shutting down...
We increased the xroad-proxy
heap up to 2G to no avail. After further investigation, we noticed that the message timestamping and related DB queries to the logrecord
table were taking up considerable time:
We're using default values for retention and message body logging (30
and true
). The approximate row count in the logrecord
table was ~127k. Increasing the DB cluster size or the xroad-proxy
heap didn't really help. What did eventually help was truncating the message log - that resulted in:
- Timestamping times dropped from ~20 secs to < 100ms
- Proxy heap usage dropped considerably
- No more OOMs
- Database connectivity, DB load and general performance hugely improved
Prior to truncating the table, the system was virtually inoperable.
Any idea why the timestamping queries took so long - is there an issue with indices in the table? How do we turn off message logging completely? We already reduced retention to 1d and skipped body logging --> hugely improved performance.
Relates to #1292.
Hello @autero1,
I investigated it a bit. I think there are 2 separate issues:
-
LOGRECORD
table does not have an index suitable for the request composed here https://github.com/nordic-institute/X-Road/blob/6c04d5258b836bffc3175217c9cf1295540cff93/src/addons/messagelog/messagelog-addon/src/main/java/ee/ria/xroad/proxy/messagelog/LogRecordManager.java#L172-L174 . Currently onlylogrecordpk
primary key constraint is used. It may affect the performance. As it is only ~127000, it shouldn't. -
messagelog-addon
is implemented synchronously as Scala'sAwait.result
is used that blocks a thread until the timestamp provider (xroad-signer) responds. https://github.com/nordic-institute/X-Road/blob/28d16c08d44cc39dcf8b72334a9dcdba035c11d4/src/addons/messagelog/messagelog-addon/src/main/java/ee/ria/xroad/proxy/messagelog/LogManager.java#L206
Regarding #2, how much traffic is going through the particular Security Server?
The DB update execution should not cause the heap increase. But, the provided reference sounds odd. I'd rather think that waiting for the response from the signer blocks so many threads that it inflicts the proxy processing.
Regarding #2, how much traffic is going through the particular Security Server?
The ~127k rows in the logrecord
table for 30 days is fairly indicative of the traffic. So not very high traffic volume. Just a few thousand reqs / day.
On a side note, the servers / mem usage, DB connections all stabilized after truncating the table. Memory usage for the k8s Pod dropped at least 50%. Go figure 🤷
One way to disable message logging completely is to use the slim
variant of the Security Server Sidecar Docker image. In your case, the correct variant would be 7.1.0-slim-fi
or 7.1.0-slim-primary-fi
/ 7.1.0-slim-secondary-fi
if you're running a Security Server cluster. The differences between different variants are described in Docker Hub in the available images section. Disabling message logging should have a quite significant effect on the performance. When message logging is disabled, timestamping is also disabled.
The downside of the slim
image variant is that it doesn't include environmental and operational monitoring either. It means that collecting service usage statistics is pretty challenging. The 7.1.0-fi
variant of the Sidecar includes environmental monitoring, operational monitoring and message log. Another alternative would be to use the 7.1.0-fi
variant and disable message log add-on manually. It can be done by renaming the /usr/share/xroad/jlib/addon/proxy/messagelog.conf
file and restarting the xroad-proxy
service inside the container. However, the downside is that the change has to be done every time when a new container is created.
mv /usr/share/xroad/jlib/addon/proxy/messagelog.conf /usr/share/xroad/jlib/addon/proxy/messagelog.conf.disabled
By the way, in the -fi
image variants message body logging is false
by default. If it was true
in your configuration by default, you're probably using the vanilla variant of the image. For the Finnish Palveluväylä users it's strongly recommended to use the -fi
variants since they contain some Palveluväylä specific overrides to the default configuration.