spring-integration icon indicating copy to clipboard operation
spring-integration copied to clipboard

Webflux outbound handler randomly not releasing API response and triggering gateway timeout.

Open Christopher-Flanagan opened this issue 3 years ago • 15 comments

In what version(s) of Spring Integration are you seeing this issue?

spring-integration-webflux - 5.5.12 spring-boot-starter-integration - 2.7.0

Describe the bug

Bug - Webflux outbound handler randomly not releasing API response and triggering gateway timeout.

In our application, we have developed a reusable flow that makes HTTP requests to our internal APIs, via the web flux outbound handler. This reusable flow is triggered by a message gateway with a default timeout of 30 seconds.

What we have noticed during our end-to-end testing is that randomly the web flux handler will receive a response from our API, but will not release the response until after our message gateway has timed out. Once the gateway has timeout out, our reusable flow continues its processing of the message and then attempts to return the payload but fails due to the gateway has already timed out.

2022-10-14 07:32:03.279 WARN 1 --- [oundedElastic-5] cMessagingTemplate$TemporaryReplyChannel: Reply message received but the receiving thread has exited due to a timeout:

To Reproduce

Unfortunately, we have been unable to determine a way to reproduce this issue.

Expected behavior

We expect the outbound gateway to pass the response to the next step in our flow and not trigger the timeout on our message gateway.

Sample

A basic example of the reusable flow along with the log file with logging set to TRACE https://github.com/Christopher-Flanagan/webflux-issue

Christopher-Flanagan avatar Oct 14 '22 22:10 Christopher-Flanagan

Are you sure that there is no way to make that sample project much simpler? It is really hard to understand what is going on with so much custom code.

Another question: are you sure that 30 seconds is enough for your requests? You confirm that you got that Reply message received but the receiving thread has exited due to a timeout, which means that you got a reply, although it comes to you much later, than 30 seconds.

Give us a simple clue, please, what is that "not releasing API response". Really not sure how to help you since we are not on the same page...

And one more thought: did you try your solution with the Http.outboundGateway() instead of WebFlux? Why do you need WebFlux at all since you do blocking for requests anyway?

artembilan avatar Oct 17 '22 15:10 artembilan

Hey @artembilan,

Thanks for replying back, but of course, I can try to make it simpler I wasn't sure how detailed it needed to be there, so I thought what was there was fine.

Regarding the timeout, yes I can confirm that 30 seconds is more than enough time to process our request, it's a simple process and it completes in under 1-2 secs, we also confirm this by viewing the changes made to the DB.

What we mean by "not releasing API the response" is that we can see the response status of our request Response status: 200 OK in the logs along with the state change of [response_completed], this all happens within a one second of the request being made, but the next chain in our flow which simply prints out the payload from the outboundGateway does not trigger for another 30 seconds.

It appears to us that the response has been completed but the outbound gateway handler is not producing a reply. I am not sure if you have viewed the README.MD in the but it points to key lines in the log.txt file which is the TRACE file of the process if that is helpful to you.

We did consider moving over to the Http.outboundGateway, but thought that this might be a better option to pursue first before changing the code and having to rest a majority of the application.

Christopher-Flanagan avatar Oct 17 '22 16:10 Christopher-Flanagan

Yeah... Would be great to be able to reproduce with a simple project and even without Azure and Spring Cloud at all.

Right now there are too much stuff to digest.

Is that OK to see WebClientResponseException$InternalServerError: 500 Internal Server Error from DELETE https://test-two-api/transfers/36505144 in your logs?

artembilan avatar Oct 17 '22 16:10 artembilan

I completely agree, I do find it difficult to articulate the issue as there is a lot of moving parts to it. In regards to the 500 error that is expected from the API.

We will keep trying to create a project that can reproduce the issue, but unfortunately, it is quite random and rare for it to happen.

Christopher-Flanagan avatar Oct 17 '22 16:10 Christopher-Flanagan

My apologies I accidentally hit the close ticket.

Christopher-Flanagan avatar Oct 17 '22 16:10 Christopher-Flanagan

It look like this one is an answer for a first request:

2022-10-14 09:20:08.278 DEBUG 1 --- [or-http-epoll-4] c.b.p.config.client.WebClientConfig      : Response status: 204 NO_CONTENT

so, no content - therefore nothing to process. What do I miss?

artembilan avatar Oct 17 '22 16:10 artembilan

.route(Message.class, "SOME CHANNEL name")

???

This one calls the API like:

	/**
	 * Populate the {@link MethodInvokingRouter} for the method
	 * of the provided service and its method with default options.
	 * @param service the service to use.
	 * @param methodName the method to invoke.
	 * @return the current {@link BaseIntegrationFlowDefinition}.
	 * @see MethodInvokingRouter
	 */
	public B route(Object service, String methodName) {

Which is fully wrong from the language perspective: the method name cannot be SOME CHANNEL name and there is indeed no such a method in the Message class. I'm very confused...

artembilan avatar Oct 17 '22 17:10 artembilan

I also see in your logs something like this:

2022-10-14 09:20:38.248 DEBUG 1 --- [oundedElastic-5] o.s.integration.handler.LoggingHandler   : Outbound gateway payload : <204 NO_CONTENT No Content,[set-cookie:"dtCookie=v_4_srv_8_sn_ECED1D560535A769AF1AB4E85F63DD08_perc_100000_ol_0_mul_1_app-3Aea7c4b59f27d43eb_1; Path=/; Domain=.aroapp.io", "324a21f111618e91d5637f6f837d8ae1=d5ecb4890130744940317cc2a9e7aaeb; path=/; HttpOnly; Secure; SameSite=None", x-oneagent-js-injection:"true", vary:"Origin", "Access-Control-Request-Method", "Access-Control-Request-Headers", date:"Fri, 14 Oct 2022 16:20:08 GMT", cache-control:"private"]>

Which is apparently a response for that first message. Do you really have a proper handling for this data?

artembilan avatar Oct 17 '22 17:10 artembilan

Yes I understand, when you suggested that it was difficult to follow I removed some of the complexity as it seemed like the TRACE was not being used to match up with the code, which is the reason for the complexity

I reverted it back so that you can match up the code to the log file

Christopher-Flanagan avatar Oct 17 '22 17:10 Christopher-Flanagan

For that data, we don't have a handler as we simply require the HTTP status, we store the original payload/message in the message header and extract/transform it again after we receive that response status code.

Christopher-Flanagan avatar Oct 17 '22 17:10 Christopher-Flanagan

So, your logic looks like this:

  1. Send a request.
  2. If 204 response send a create request via the same HTTP gateway.
  3. Then you go to some REVERT_TO_ORIGINAL_PAYLOAD_CHANNEL which does not have any subscriber in your code at the moment.

And that all indeed happens in the same thread which is waiting for reply in the beginning.

What I see according to your logs that there is indeed some 30 seconds delay: This is the answer for original request.

2022-10-14 09:20:08.278 DEBUG 1 --- [or-http-epoll-4] c.b.p.config.client.WebClientConfig      : Response status: 204 NO_CONTENT

and this is sub subsequent, nested request for the "NO_CONTENT" sub-flow:

2022-10-14 09:20:38.253  INFO 1 --- [oundedElastic-5] o.s.integration.handler.LoggingHandler   : No shipment record has been found.

See the difference in seconds.

It is not a WebFlux problem, but more your solution logic. I'm not sure where to look, but probably you are blocked somehow on some DB calls or whatever.

artembilan avatar Oct 17 '22 17:10 artembilan

Correct, that basically the flow for this part, I could include theREVERT_TO_ORIGINAL_PAYLOAD_CHANNEL section but it would add complexity to it.

To be honest, I wish it was some logic error on our side as I don't like to waste people's time but we simply don't see it, as this code works 99% of the time with no issue. I would like to point out one thing if it is your opinion that it is still a logic issue I will close the ticket.

As we can see the response/answer for the original request here :

2022-10-14 09:20:08.278 DEBUG 1 --- [or-http-epoll-4] c.b.p.config.client.WebClientConfig : Response status: 204 NO_CONTENT

If you look into the flow that makes this HTTP request you will notice that there is a line that prints out the payload directly after the web flux message handler -

.log(LoggingHandler.Level.DEBUG, m -> String.format("Outbound gateway payload : %s", m.getPayload()))

which it does here, but with a difference of 30 seconds -

2022-10-14 09:20:38.248 DEBUG 1 --- [oundedElastic-5] o.s.integration.handler.LoggingHandler : Outbound gateway payload : <204 NO_CONTENT No Content,[set-cookie:"dtCookie=v_4_srv_8_sn_ECED1D560535A769AF1AB4E85F63DD08_perc_100000_ol_0_mul_1_app-3Aea7c4b59f27d43eb_1; Path=/; Domain=.aroapp.io", "324a21f111618e91d5637f6f837d8ae1=d5ecb4890130744940317cc2a9e7aaeb; path=/; HttpOnly; Secure; SameSite=None", x-oneagent-js-injection:"true", vary:"Origin", "Access-Control-Request-Method", "Access-Control-Request-Headers", date:"Fri, 14 Oct 2022 16:20:08 GMT", cache-control:"private"]>

There is no custom logic here from us apart from our Internal API, and the endpoint we call simply returns a 200 or 204 status code with nothing else, and we can see that response status at 2022-10-14 09:20:08.278

So how could there be a logic error here, when all we do is simply wait for the reply from the handler? but the handler doesn't return it for another 30 seconds, which coincidentally lines up with the message gateway timeout.

Christopher-Flanagan avatar Oct 17 '22 18:10 Christopher-Flanagan

Yeah... I see your point.

So, in logs there is present this line:

2022-10-14 09:20:08.343 DEBUG 1 --- [or-http-epoll-4] o.s.i.h.support.DefaultHttpHeaderMapper  : setting headerName=[cache-control], value=private

Which is the last log for that first 204 status message. In the code this happens at this WebFlux handler line:

.map(this::getReply);

which means that reply Mono is indeed subscribed.

Then it looks like the WebClient got busy with something else without really returning from a thread we just got a response.

How about to consider to use an internal WebClient instance instead of shared one? I mean configure your httpOutboundGateway for WebFlux.outboundGateway() with a WebClient arg.

artembilan avatar Oct 17 '22 19:10 artembilan

You are correct, it is related to the web client. I was able to create a unit test that reproduces the issue and have pushed it to the repo., it appears to happen when two messages use the web client at the same time.

So the reason we share the WebClient bean is due to our OAuth server, the token is valid for each of the API requests and we thought it would be possible to share the web client and just update the URLs.

Christopher-Flanagan avatar Oct 17 '22 21:10 Christopher-Flanagan

Glad to hear that we have localized a culprit! 😄

Can you, please, confirm what is the workaround? Can we reproduce the problem without Spring Integration?

artembilan avatar Oct 17 '22 21:10 artembilan

Hi there!

Any news? Do we still have a problem from Spring Integration perspective? Or we simply can close with a Works as Designed label?

artembilan avatar Oct 24 '22 16:10 artembilan

Hi @artembilan

Apologies for the delay, we were still trying to determine the root causes of this issue, unfortunately, our test cases led to a number of false positives.

Instead of spending more time on this issue, we decided last Friday to simply change over to using rest templates, this is currently being tested now, and we have not run into any of the timeout issues we faced with web client.

I will close this ticket we have no further updates to post. Cheers for all the help with this.

Christopher-Flanagan avatar Oct 24 '22 22:10 Christopher-Flanagan