quarkus icon indicating copy to clipboard operation
quarkus copied to clipboard

Java heap space issues (memory leak?) when migrating from RESTEasy classic to RESTEasy reactive

Open DeMol-EE opened this issue 6 months ago • 7 comments

Describe the bug

We decided to finally migrate from RESTEasy classic to RESTEasy reactive but when our build landed in production, we quickly started running into java heap space issues. By reverting back to RESTEasy classic (and doing only this), the problem disappears.

Migrated extensions:

  • quarkus-resteasy-jackson
  • quarkus-rest-client-jackson
  • (quarkus-resteasy-multipart) - a bit special because we use it for classic, but it is bundled inside reactive
  • quarkus-oidc-client-filter
  • quarkus-keycloak-admin-client

Expected behavior

Similar memory pressure for RESTEasy classic and RESTEasy reactive

Actual behavior

Higher memory pressure (leading to java heap space issues) with RESTEasy reactive, than with RESTEasy classic (which does not run into java heap space issues).

How to Reproduce?

I don’t have a reproducer but I’m also not sure how to write one for this issue. We didn’t notice the problem until we deployed in production, where the load is unique. I could try to spend some time with a load testing suite to see if I can further pinpoint the issue to a particular coding paradigm, but for now all I can provide are screenshots and heap dumps.

For reference, it’s a pretty big codebase (not the average microservice) and we have a relatively slow release cycle (little under 3 months) so there were a lot of changes to examine.

One particular special thing we employ in our code base is the usage of subresources, which I have understood meanwhile seems to be a rarely used feature with which we have already identified multiple issues.

As it could be related, I want to mention the points that came up when switching between classic and reactive (as it highlights areas that are affected and thus features we use):

  • we use compression, which is quite different between the 2 flavors (@Compressed vs @GZIP)
  • we have one endpoint that uses multipart (but it is not called regularly and I don’t think this is the culprit)
  • we have a handful of rest clients for which we use OidcClientRequestFilter

Output of uname -a or ver

eclipse terumin 17-jdk-ubi9-minimal (docker image)

Output of java -version

JDK 17

Quarkus version or git rev

3.6.7

Build tool (ie. output of mvnw --version or gradlew --version)

mvn 3.9.5

Additional information

We’re running JVM mode on an openshift 4 cluster. This service is running on 4 pods and handles about 50-100 requests per second. Here’s a screenshot of our grafana dashboard that monitors memory usage across the latest 24h:

image

The vertical bars yesterday highlight pods being killed and rescheduled continuously. Here is a close up of the afternoon, which shows that the average lifetime of a pod was < 10 minutes:

image

We deployed the build wherein we revert from RESTEasy reactive to RESTEasy classic around 17:10

image

This is also the end of the business day, so it is normal that we see a significant drop in requests/traffic at this point. We skip the night and fast forward to this morning, where we see an uptick in pressure, but no more heap space issues and thus no more pods being killed:

image

If desirable, I have a handful of heap dumps from a pod with RESTEasy reactive and a pod with RESTEasy classic. Though I have no reproducer, perhaps examining the dumps could reveal a possible cause.

DeMol-EE avatar Feb 13 '24 10:02 DeMol-EE

/cc @FroMage (resteasy-reactive), @geoand (openshift,resteasy-reactive), @iocanel (openshift), @stuartwdouglas (resteasy-reactive)

quarkus-bot[bot] avatar Feb 13 '24 10:02 quarkus-bot[bot]

Very interesting.

Obviously a reproducer project would be the absolute best way to try and nail down what the problem is, but if that is not available, we might be able to find the root cause by examining heap dumps if such are available.

geoand avatar Feb 13 '24 11:02 geoand

Heap dumps would be useful but do not make them public, especially if they can contain sensitive information.

If you are allowed to share them with us, we will make available a secured space for you to upload them.

If not, inspect them yourself and it might provide some very valuable insights.

https://www.baeldung.com/java-heap-dump-capture might help if you are not familiar with heap dumps.

gsmet avatar Feb 13 '24 11:02 gsmet

Heap dumps would be useful but do not make them public, especially if they can contain sensitive information.

Definitely. That's probably something we should document somewhere

geoand avatar Feb 13 '24 11:02 geoand

We were using visualvm to look at the dumps (taken with jmap) but couldn’t find anything blatantly obvious. We’re able and happy to send the dumps to a safe space @gsmet.

I did have a new thought: I recall seeing a lot of ZIP related things on the heap, and given that compression works differently between the two flavors of RESTEasy, I was wondering if perhaps that could be related. Config properties for classic:

quarkus.resteasy.gzip.enabled=true
quarkus.resteasy.gzip.max-input=10M

For reactive:

quarkus.http.enable-compression=true
quarkus.http.compress-media-types=text/html,text/plain,text/xml,text/css,text/javascript,application/javascript,application/graphql+json,application/json

We have only 7 endpoints that use compression, but some of them are queried rather heavily...

DeMol-EE avatar Feb 13 '24 12:02 DeMol-EE

Can you send us an email at gsmet at redhat dot com and gandrian at redhat dot com from your corporate email address?

gsmet avatar Feb 13 '24 12:02 gsmet

@DeMol-EE I strongly suggest to extract GC logs or related JFR events too, to understand how the occupation of the heap as been - I suppose memory usage to be just a part of the story (given that is the RSS) - and beware https://github.com/quarkusio/quarkus/discussions/36691#discussioncomment-7751732 -like issues eg zip inflater not closed and leading to OOM - which means that, if hit, requires something NOT running on Openshift to be troubleshoot properly (eg using async-profiler -e mprotect profiling).

franz1981 avatar Feb 14 '24 14:02 franz1981

After some digging, we found the cause of the problem. The good news is that there is no memory leak.

What was happening is that proxying a chunked response does not actually work in streaming fashion. We have 2 services, S and C, where S has a JAXRS endpoint that returns a large file in chunks, using StreamingOutput. C also has a JAXRS endpoint, which, when called, proxies the call to S. Our idea was to use a microprofile rest client for S with return type Response, and to return the response directly from C's endpoint. The assumption was that passing the naked Response would avoid consuming the data stream (as well as passing through headers and status code etc). This worked with RESTEasy classic, but it appears that this was just a happy coincidence. It is not portable, and does not work in RESTEasy reactive, which seems to load the entire response from S before returning it from C.

A better way to achieve our goal is to replace Response with InputStream (and manually handle error codes and headers). By doing this, the data is streamed as expected.

DeMol-EE avatar Feb 29 '24 20:02 DeMol-EE

Thanks a lot for finding the root cause of this!

geoand avatar Mar 01 '24 06:03 geoand

@geoand @gsmet I thought it might interest you to know that our approach of "forwarding" a Response from a microprofile REST client to a JAXRS endpoint is also problematic with JAXRS classic. We discovered today that it appears to block a vertx event(!) thread. So, even staying with classic for now, we’ve refactored our forwarding calls to use InputStream (on the side of the rest client) and StreamingOutput (on the side of JAXRS) instead.

DeMol-EE avatar Jun 20 '24 12:06 DeMol-EE