akka-http icon indicating copy to clipboard operation
akka-http copied to clipboard

HTTP client pool stops processing requests with "Exceeded configured max-open-requests value"

Open vilunov opened this issue 4 years ago • 57 comments

We use sttp v2 with single-request API (no streams). Some configs:

akka.http.host-connection-pool {
  max-connections = 64
  max-open-requests = 128 // Must be a power of 2
}

akka.http.client {
  idle-timeout = 5 seconds
  connecting-timeout = 1 second  // this is ignored when used with sttp, but we introduced a workaround
}

At some point of time this happens:

Screenshot_20201125_111803

Here the requests to typing stopped working at 17:16, shortly after the server started at 17:11. Such freeze can happen at any point of the instance's life though, even days after the start.

Here are some hypotheses that we considered:

  • The result stream is not consumed somehow and clogs the request pool -- stream consumption is handled by the sttp adapter.
  • The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.
  • Threads in the threadpool are being blocked -- in the heapdump most of the threads are either RUNNING or WAITING for new work.

Error log:

akka.stream.BufferOverflowException: Exceeded configured max-open-requests value of [128]. This means that the request queue of this pool (HostConnectionPoolSetup(typing,80,ConnectionPoolSetup(ConnectionPoolSettings(64,0,2,128,1,1 minute,100 milliseconds,15 seconds,30 seconds,ClientConnectionSettings(Some(User-Agent: gateway),3 seconds,300 milliseconds,512,None,WebSocketSettings(<function0>,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$236/564086167@db159b5),List(),ParserSettings(2048,16,64,64,8192,64,Some(33554432),8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$217/203048480@5a78e22),100 milliseconds,None,Http2ClientSettingsImpl(256,65536,10000000,512000,1024,false,None),TCPTransport),1 second,List()),akka.http.scaladsl.HttpConnectionContext$@20049e4b,akka.event.MarkerLoggingAdapter@116df1d))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
	at akka.http.impl.engine.client.PoolInterface$Logic.<init>(PoolInterface.scala:106)
	at akka.http.impl.engine.client.PoolInterface$PoolInterfaceStage.createLogicAndMaterializedValue(PoolInterface.scala:94)
	at akka.stream.impl.GraphStageIsland.materializeAtomic(PhasedFusingActorMaterializer.scala:700)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:499)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:449)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:441)
	at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:703)
	at akka.http.impl.engine.client.PoolInterface$.apply(PoolInterface.scala:80)
	at akka.http.impl.engine.client.PoolMasterActor.akka$http$impl$engine$client$PoolMasterActor$$startPoolInterface(PoolMasterActor.scala:131)
	at akka.http.impl.engine.client.PoolMasterActor$$anonfun$receive$1.applyOrElse(PoolMasterActor.scala:162)
	at akka.actor.Actor.aroundReceive(Actor.scala:534)
	at akka.actor.Actor.aroundReceive$(Actor.scala:532)
	at akka.http.impl.engine.client.PoolMasterActor.aroundReceive(PoolMasterActor.scala:111)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
	at akka.actor.ActorCell.invoke(ActorCell.scala:543)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
	at akka.dispatch.Mailbox.run(Mailbox.scala:230)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

vilunov avatar Nov 25 '20 08:11 vilunov

I don't have much experience with sttp. A minimal project that shows the problem in isolation and can be used to further diagnose the problem might be a good start?

raboof avatar Nov 25 '20 09:11 raboof

We cannot provide a minimal project at all. The same project running in our staging environment, this only happens in production under heavy load. sttp should not be a problem here, as these errors come from akka.

vilunov avatar Nov 25 '20 10:11 vilunov

The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.

I think this is the most likely case. Since timeouts currently don't purge requests from the pool, you can end up in a situation where all requests will be handled too late and eventually time out (especially since the queueing time is part of the timeout).

jrudolph avatar Nov 25 '20 10:11 jrudolph

Dealing with server overload is a hard problem and there's not a good solution to that problem currently. After all the goal must be to shave off as much work from the server as possible when you detect the problem (however, where should that those requests then go?).

Here are some hints at what you could try to do:

  • provide more capacity at the server
  • keep queues as small as possible (max-open-requests even smaller than 128) and then put a buffer before the pool that you manage actively. Add a circuit breaker in front of it that rejects requests directly when overload is detected.
  • To avoid having to restart the app completely, when you detect these kinds of situations, you can shutdown the pool manually. You can get a handle to the pool (HostConnectionPool) by materializing Http().cachedHostConnectionPoolHttps for the host in question and call HostConnectionPool.shutdown. There's also Http().shutdownAllConnectionPools() (though, I think that might leave some requests hanging).

If you can provide debug logging for the pool (filter for class akka.http.impl.engine.client.PoolId), we can look into what other causes could lead to the behavior. Which version of akka-http do you use?

jrudolph avatar Nov 25 '20 10:11 jrudolph

I'm afraid we don't have PoolId yet -- we used 10.1.2 at the time of creating the memory dump. Right now we use 10.2.0 (downgraded form 10.2.1 due to a bug with S3) and observe the same problem.

However, I see objects of class akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot and see that most of the have _state = WaitingForResponse. I assume this means that no answer has been received yet.

vilunov avatar Nov 25 '20 13:11 vilunov

However, I see objects of class akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot and see that most of the have _state = WaitingForResponse. I assume this means that no answer has been received yet.

Yes

jrudolph avatar Nov 25 '20 13:11 jrudolph

We followed your advice with keeping queues as small as possible and used following configs:

akka.http.host-connection-pool {
  max-connections = 256
  max-open-requests = 256 // Must be a power of 2
}

Still, shortly after startup we encountered the same problem with one of our instances, all requests to a single hostname were failed without a single successful request completion during 30 minutes. We will retrieve and analyze the heap dump a bit later, what should we look for besides PoolId objects?

vilunov avatar Nov 26 '20 12:11 vilunov

DEBUG log information might be more interesting than looking through the dumps because they will also give a timeline of what's going on.

jrudolph avatar Nov 26 '20 13:11 jrudolph

Can you specify which exact loggers should we enable debug-level logging for? We cannot afford to store all debug logs, sadly

vilunov avatar Nov 26 '20 13:11 vilunov

akka.http.impl.engine.client.PoolId might be enough but otherwise, everything under akka.http would help.

jrudolph avatar Nov 26 '20 13:11 jrudolph

We cannot afford to enable debug logs for this loggers as it increases our logs volume tenfold, we simply do not have the resources to store so much logs at our scale.

Following the advice, we disabled the queue by setting both max-connections max-open-requests to 256. This did not help us, we still have stuck client pools.

All slots in the akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1 have a _state of WaitingForResponse with waitingForEndOfRequestEntity = false. They also have a disconnectAt set to a time earlier than the current time (moment of the dump creation), currentTimeout = null and currentTimeoutId = -1. Why are they not terminated by the timeout?

vilunov avatar Nov 30 '20 14:11 vilunov

All slots in the akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1 have a _state of WaitingForResponse with waitingForEndOfRequestEntity = false. They also have a disconnectAt set to a time earlier than the current time (moment of the dump creation), currentTimeout = null and currentTimeoutId = -1. Why are they not terminated by the timeout?

That means that all requests are waiting for responses from the server. The disconnectAt timeout is only applied between requests. Since we don't enfore any response timeouts currently, the connections will only be reclaimed when the idle timeout kicks in. Are you saying that not even the idle-timeout kicks in?

jrudolph avatar Dec 03 '20 11:12 jrudolph

Nope, we have idle-timeout set to 5 seconds and it kicks for some requests judging by the logs, but disconnectAt has a value of current time minus ~30 minutes, i.e. these requests are waiting for responses for around half an hour. Metrics indicate the same – no requests are completed for a long time.

vilunov avatar Dec 03 '20 15:12 vilunov

Hard to say what could be the reason for the idle-timeout not being triggered. I think that's the first time that I hear about the idle-timeout not working. This is just a wild guess but if you use Kamon, you could try running without it. Hard to diagnose it without more info.

jrudolph avatar Dec 03 '20 15:12 jrudolph

No, we don't use Kamon. By metrics I mean our own non-invasive metrics around requests invocations.

vilunov avatar Dec 03 '20 15:12 vilunov

Looks like we get to join the dance. (Akka HTTP 10.1.x). I don't have any more information to bring to the table other then @vilunov isn't the only one suffering from it.

In our case we use rather big pools.

  client{
    connecting-timeout = 15s
    idle-timeout = 50s
  }
  
  
  host-connection-pool{
    idle-timeout = 500s
    max-connections = 32767
    min-connections = 0
    max-open-requests = 131072
  }

They don't time out, and just get stuck for hours. I've seen it auto resolve in some cases. But current occurrence has been stuck for about 24h now.

spangaer avatar Dec 14 '21 12:12 spangaer

Another thing I can add to is that I have code doing Akka HTTP to Akka HTTP calls, that don't seem to suffer while these calls have an ALB sitting in between.

spangaer avatar Dec 14 '21 12:12 spangaer

Could this somehow be related to this failure? https://github.com/akka/akka-http/issues/3916

SkyTrix avatar Dec 14 '21 14:12 SkyTrix

If this happens with high CPU load in the pool, then it might be related to the issue fixed in 67ac11c280c74a0ee9904e3aeff4fe719d38836f. I'd recommend to update to 10.2.x for big pools (> 500-1000 connections).

jrudolph avatar Dec 14 '21 15:12 jrudolph

Does that commit fix some slot release escapes somehow?

(I've looked at the commit, but it doesn't seem to be apparent from there to me)

Yes, high CPU load would have been involved here.

spangaer avatar Dec 14 '21 15:12 spangaer

Does that commit fix some slot release escapes somehow?

(I've looked at the commit, but it doesn't seem to be apparent from there to me)

Yes, high CPU load would have been involved here.

That commit fixes that the pool management logic itself spins and contributes to CPU load up to a degree where little useful work can be done by the pool. If you are saying it isn't making any progress at all, it might still be something different, but even in that case, it would be good to try on 10.2 and see if it shows the same issue.

jrudolph avatar Dec 14 '21 15:12 jrudolph

Well the state is 32k slots in WaitingForResponse, with seemingly all SlotConnections to claim being connected while nothing is moving anymore (since last evening) and server side and ALB timeouts long expired..

From what you're saying it seems like I should expect a single CPU cooking all this time? Which is not the case.

It seems to involve some kind of connectivity issues of which the errors escape the slot release process.

spangaer avatar Dec 14 '21 15:12 spangaer

While studying on this code (v10.1.x; but still valid) https://github.com/akka/akka-http/blob/v10.2.7/akka-http-core/src/main/scala/akka/http/impl/engine/client/pool/NewHostConnectionPool.scala#L289

Isn't that a race condition?

If lines 271 and 288 are reached at the same time two invocations of updateState() will be running side-by-side and the outcome will be rather unpredictable.

currentTimeoutId should be volatile, because it reads from a different thread then the one clearing/changing it, but even that won't cut it I fear. Seems like either it needs to involve checks and sets on both sides, in the right order (or some locking) or ensuring it happens inside the actor "event loop".

Or am I missing something?

spangaer avatar Dec 14 '21 19:12 spangaer

Further study: If a materialized connectionFlow: Flow[HttpRequest, HttpResponse, Future[Http.OutgoingConnection]] of HostConnectionPoolStage , once connected, never produces an error. The slot will get stuck in WaitingForResponse(.

At that point, no timeout code is active in the connection pool.

spangaer avatar Dec 14 '21 20:12 spangaer

This fix seems to acknowledge such a possibility. https://github.com/akka/akka-http/blame/v10.1.14/akka-http-core/src/main/resources/reference.conf#L342

Any chance that a heavy CPU load could make the 100 ms fall short. And if so, what is the risk of raising that value to say 20 seconds?

spangaer avatar Dec 14 '21 21:12 spangaer

By now I have as much as a smoking gun. I'm able to reproduce by creating load on a local system and get some hanging slots.

(was trying to validate if those settings did something for me)

With debug logging enabled they spit out this

[435 (WaitingForResponse)] Connection cancelled 
[534 (WaitingForResponse)] Connection cancelled 
[449 (WaitingForResponse)] Connection cancelled

I found these by first looking at 1 hanging slot and finding that message, then looking for the message, which occurs the exact amount of times as the number of hanging slots (81 times).

When looking at the code, it seems to tell me that an attempt is made to send an HttpRequest to an HTTP stream, but that one can no longer process it so responds a cancellation instead. Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen. Which makes me think that the cancellation signal should be processed as some kind of failure instead of just being logged.

spangaer avatar Dec 15 '21 10:12 spangaer

By now I've figured that thing is sort of the end of the line. But more is going on before it ever gets there. But I'm working on making that visible.

spangaer avatar Dec 15 '21 15:12 spangaer

Ok, I went through 7 hells to exert the right amount of stress to reproduce the problem and at the same time not loose the debug logging, but I think I'm there.

Note: for things that have the same timestamp, you cannot trust the order. As it's retroactively gets sorted and there's no extra counter to ensure ordering

Time	message	thread_name
Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Dispatching request [GET /id/current/prpl Empty]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Before event [onNewRequest] In state [Unconnected] for [105770 ms]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299	[698 (Connecting)] After event [onNewRequest] State change [Unconnected] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299	[698 (Unconnected)] Establishing connection	listofnumbers-akka.actor.default-dispatcher-33
Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2201 ms]	listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Connection attempt succeeded	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [7402 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Slot connection was established	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:43:03.183	[698 (WaitingForResponse)] Connection cancelled	listofnumbers-akka.actor.default-dispatcher-49
Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-47
Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [69286 ms]	listofnumbers-akka.actor.default-dispatcher-47

I'm still in the process of analyzing what that sequence of events means, but for blocking slots it fairly consistently looks like this, so I thought I could share it already.

[449 (WaitingForResponse)] Connection cancelled is maintained as tell for a blocking slot.

spangaer avatar Dec 16 '21 08:12 spangaer

Thanks for this analysis, @spangaer. I'm going to have a quick look if anything stands out. Have you already tried with Akka HTTP 10.2.7? In any case, a fix will only appear in 10.2.7 as 10.1.x is now in deep maintenance mode.

jrudolph avatar Dec 16 '21 09:12 jrudolph

Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen.

Yes, that is the assumption and it is in many places. Even if we would treat cancellation as an error in the pool, we would have the connection still being active which could be a kind of memory leak.

This fix seems to acknowledge such a possibility. v10.1.14/akka-http-core/src/main/resources/reference.conf#L342 (blame)

Any chance that a heavy CPU load could make the 100 ms fall short. And if so, what is the risk of raising that value to say 20 seconds?

You can certainly try, it might make keep some things alive for a bit longer, but that shouldn't be a problem. Not sure that it would make a difference, because it might still not explain why the completion/failure is missing/hanging.

Would it be possible to make a tcpdump and send it to me? You could use tcpdump -i <interface> -s 100 -w <file>. This would only keep 100 bytes at max per packet. A memory dump might also help. (No promises on when I might get back to this, though, will be off for most of the weeks to come).

jrudolph avatar Dec 16 '21 09:12 jrudolph