bramble icon indicating copy to clipboard operation
bramble copied to clipboard

Socket hang up with long running request to service

Open codedge opened this issue 1 year ago • 5 comments

Hey!

I experience a weird behaviour when having long running requests in my service connected to the gateway. I run a PHP-based service behind the gateway, that sometimes needs up to 45s to return the response. In 90% of the cases the response is not returned by the gateway and instead I get a Socket hang up back.

I already enabled the Limits plugin and put this there

{
      "name": "limits",
      "config": {
        "max-response-time": "120s",
        "max-request-bytes": 1000000
      }
}

This removes the initial reached timeout error message, but still I have the problem that the connection between the gateway and the service somehow gets lost.

I am 100% sure, that the response is correct and is returned by the (backend) service properly. When calling the GraphQL endpoint of the backend service directly, there is no issue at all.

Does that somehow sound familiar to you or any hint where to look?

Thanks!

codedge avatar Nov 13 '23 08:11 codedge

Hi @codedge, can you post a copy of the response you're getting? The string Socket hang up seem to show up when I search the go stdlib and you mention later it's a reached timeout.

Does it log the request when it fails?

pkqk avatar Nov 13 '23 20:11 pkqk

Sorry for the confusion.

1. Resolving the reached timeout

At first I got a reached timeout error. This error was directly visible inside the logs of Bramble. I figured out, that by using the limits plugin with the above mentioned configuration, I can get around this error.

This is solved ✔️

2. The Socket hang up problem

This error is returned by curl (or Postman) or any other GraphQL client. There is no other response.

Error in curl

2023-11-13_224641

Error in Postman

2023-11-13_225222

I tend to say this is some keepAlive/idle timeout problem.

I also found this link, which talks about the net.http.Server.WriteTimeout.

It logs the request towards the backend service, but it does not log the response coming back.

codedge avatar Nov 13 '23 21:11 codedge

.. and I can confirm, that changing the WriteTimeout to f. ex. 60

func runHandler(ctx context.Context, wg *sync.WaitGroup, name, addr string, handler http.Handler) {
	srv := &http.Server{
		Addr:         addr,
		Handler:      handler,
		ReadTimeout:  5 * time.Second,
		WriteTimeout: 60 * time.Second,
		IdleTimeout:  120 * time.Second,
	}
        // ...
}

everything works flawlessly.

Do you think you can make this configurable via the limits plugin?

Update

I can see that there are three server instances runnning - public, private, metrics. I guess in my case only the one for public is the relevant one.

Ideally the user is able to configure this for each of these three.

I would create a PR (if you don't find time).

codedge avatar Nov 13 '23 22:11 codedge

Thanks for doing the debugging @codedge, that makes sense, if the write timeout is set to 10s by default it will be closing the socket before your service has responded.

It would be useful to have bramble craft a timeout response in that situation but we can make the socket settings tuneable as well.

The public and private muxs are there so you can have plugins apply different middleware to an published endpoint and an internal endpoint, i.e. we have auth on the public mux which is exposed via ingress to our webapp and the private mux serves backend services which are inside our VPC.

pkqk avatar Nov 15 '23 00:11 pkqk

Is there a release planned to include the new configuration?

codedge avatar Nov 28 '23 10:11 codedge