akka-http
akka-http copied to clipboard
HTTP client pool stops processing requests with "Exceeded configured max-open-requests value"
We use sttp v2 with single-request API (no streams). Some configs:
akka.http.host-connection-pool {
max-connections = 64
max-open-requests = 128 // Must be a power of 2
}
akka.http.client {
idle-timeout = 5 seconds
connecting-timeout = 1 second // this is ignored when used with sttp, but we introduced a workaround
}
At some point of time this happens:
Here the requests to typing
stopped working at 17:16, shortly after the server started at 17:11. Such freeze can happen at any point of the instance's life though, even days after the start.
Here are some hypotheses that we considered:
- The result stream is not consumed somehow and clogs the request pool -- stream consumption is handled by the sttp adapter.
- The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.
- Threads in the threadpool are being blocked -- in the heapdump most of the threads are either RUNNING or WAITING for new work.
Error log:
akka.stream.BufferOverflowException: Exceeded configured max-open-requests value of [128]. This means that the request queue of this pool (HostConnectionPoolSetup(typing,80,ConnectionPoolSetup(ConnectionPoolSettings(64,0,2,128,1,1 minute,100 milliseconds,15 seconds,30 seconds,ClientConnectionSettings(Some(User-Agent: gateway),3 seconds,300 milliseconds,512,None,WebSocketSettings(<function0>,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$236/564086167@db159b5),List(),ParserSettings(2048,16,64,64,8192,64,Some(33554432),8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$217/203048480@5a78e22),100 milliseconds,None,Http2ClientSettingsImpl(256,65536,10000000,512000,1024,false,None),TCPTransport),1 second,List()),akka.http.scaladsl.HttpConnectionContext$@20049e4b,akka.event.MarkerLoggingAdapter@116df1d))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
at akka.http.impl.engine.client.PoolInterface$Logic.<init>(PoolInterface.scala:106)
at akka.http.impl.engine.client.PoolInterface$PoolInterfaceStage.createLogicAndMaterializedValue(PoolInterface.scala:94)
at akka.stream.impl.GraphStageIsland.materializeAtomic(PhasedFusingActorMaterializer.scala:700)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:499)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:449)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:441)
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:703)
at akka.http.impl.engine.client.PoolInterface$.apply(PoolInterface.scala:80)
at akka.http.impl.engine.client.PoolMasterActor.akka$http$impl$engine$client$PoolMasterActor$$startPoolInterface(PoolMasterActor.scala:131)
at akka.http.impl.engine.client.PoolMasterActor$$anonfun$receive$1.applyOrElse(PoolMasterActor.scala:162)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at akka.http.impl.engine.client.PoolMasterActor.aroundReceive(PoolMasterActor.scala:111)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
at akka.actor.ActorCell.invoke(ActorCell.scala:543)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
at akka.dispatch.Mailbox.run(Mailbox.scala:230)
at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
I don't have much experience with sttp. A minimal project that shows the problem in isolation and can be used to further diagnose the problem might be a good start?
We cannot provide a minimal project at all. The same project running in our staging environment, this only happens in production under heavy load. sttp should not be a problem here, as these errors come from akka.
The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.
I think this is the most likely case. Since timeouts currently don't purge requests from the pool, you can end up in a situation where all requests will be handled too late and eventually time out (especially since the queueing time is part of the timeout).
Dealing with server overload is a hard problem and there's not a good solution to that problem currently. After all the goal must be to shave off as much work from the server as possible when you detect the problem (however, where should that those requests then go?).
Here are some hints at what you could try to do:
- provide more capacity at the server
- keep queues as small as possible (
max-open-requests
even smaller than 128) and then put a buffer before the pool that you manage actively. Add a circuit breaker in front of it that rejects requests directly when overload is detected. - To avoid having to restart the app completely, when you detect these kinds of situations, you can shutdown the pool manually. You can get a handle to the pool (
HostConnectionPool
) by materializingHttp().cachedHostConnectionPoolHttps
for the host in question and callHostConnectionPool.shutdown
. There's alsoHttp().shutdownAllConnectionPools()
(though, I think that might leave some requests hanging).
If you can provide debug logging for the pool (filter for class akka.http.impl.engine.client.PoolId
), we can look into what other causes could lead to the behavior. Which version of akka-http do you use?
I'm afraid we don't have PoolId
yet -- we used 10.1.2 at the time of creating the memory dump. Right now we use 10.2.0 (downgraded form 10.2.1 due to a bug with S3) and observe the same problem.
However, I see objects of class akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot
and see that most of the have _state = WaitingForResponse
. I assume this means that no answer has been received yet.
However, I see objects of class
akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot
and see that most of the have_state = WaitingForResponse
. I assume this means that no answer has been received yet.
Yes
We followed your advice with keeping queues as small as possible and used following configs:
akka.http.host-connection-pool {
max-connections = 256
max-open-requests = 256 // Must be a power of 2
}
Still, shortly after startup we encountered the same problem with one of our instances, all requests to a single hostname were failed without a single successful request completion during 30 minutes. We will retrieve and analyze the heap dump a bit later, what should we look for besides PoolId
objects?
DEBUG log information might be more interesting than looking through the dumps because they will also give a timeline of what's going on.
Can you specify which exact loggers should we enable debug-level logging for? We cannot afford to store all debug logs, sadly
akka.http.impl.engine.client.PoolId
might be enough but otherwise, everything under akka.http
would help.
We cannot afford to enable debug logs for this loggers as it increases our logs volume tenfold, we simply do not have the resources to store so much logs at our scale.
Following the advice, we disabled the queue by setting both max-connections
max-open-requests
to 256. This did not help us, we still have stuck client pools.
All slots
in the akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1
have a _state
of WaitingForResponse
with waitingForEndOfRequestEntity = false
. They also have a disconnectAt
set to a time earlier than the current time (moment of the dump creation), currentTimeout = null
and currentTimeoutId = -1
. Why are they not terminated by the timeout?
All
slots
in theakka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1
have a_state
ofWaitingForResponse
withwaitingForEndOfRequestEntity = false
. They also have adisconnectAt
set to a time earlier than the current time (moment of the dump creation),currentTimeout = null
andcurrentTimeoutId = -1
. Why are they not terminated by the timeout?
That means that all requests are waiting for responses from the server. The disconnectAt
timeout is only applied between requests. Since we don't enfore any response timeouts currently, the connections will only be reclaimed when the idle timeout kicks in. Are you saying that not even the idle-timeout
kicks in?
Nope, we have idle-timeout
set to 5 seconds and it kicks for some requests judging by the logs, but disconnectAt
has a value of current time minus ~30 minutes, i.e. these requests are waiting for responses for around half an hour. Metrics indicate the same – no requests are completed for a long time.
Hard to say what could be the reason for the idle-timeout not being triggered. I think that's the first time that I hear about the idle-timeout not working. This is just a wild guess but if you use Kamon, you could try running without it. Hard to diagnose it without more info.
No, we don't use Kamon. By metrics I mean our own non-invasive metrics around requests invocations.
Looks like we get to join the dance. (Akka HTTP 10.1.x). I don't have any more information to bring to the table other then @vilunov isn't the only one suffering from it.
In our case we use rather big pools.
client{
connecting-timeout = 15s
idle-timeout = 50s
}
host-connection-pool{
idle-timeout = 500s
max-connections = 32767
min-connections = 0
max-open-requests = 131072
}
They don't time out, and just get stuck for hours. I've seen it auto resolve in some cases. But current occurrence has been stuck for about 24h now.
Another thing I can add to is that I have code doing Akka HTTP to Akka HTTP calls, that don't seem to suffer while these calls have an ALB sitting in between.
Could this somehow be related to this failure? https://github.com/akka/akka-http/issues/3916
If this happens with high CPU load in the pool, then it might be related to the issue fixed in 67ac11c280c74a0ee9904e3aeff4fe719d38836f. I'd recommend to update to 10.2.x for big pools (> 500-1000 connections).
Does that commit fix some slot release escapes somehow?
(I've looked at the commit, but it doesn't seem to be apparent from there to me)
Yes, high CPU load would have been involved here.
Does that commit fix some slot release escapes somehow?
(I've looked at the commit, but it doesn't seem to be apparent from there to me)
Yes, high CPU load would have been involved here.
That commit fixes that the pool management logic itself spins and contributes to CPU load up to a degree where little useful work can be done by the pool. If you are saying it isn't making any progress at all, it might still be something different, but even in that case, it would be good to try on 10.2 and see if it shows the same issue.
Well the state is 32k slots in WaitingForResponse, with seemingly all SlotConnections to claim being connected while nothing is moving anymore (since last evening) and server side and ALB timeouts long expired..
From what you're saying it seems like I should expect a single CPU cooking all this time? Which is not the case.
It seems to involve some kind of connectivity issues of which the errors escape the slot release process.
While studying on this code (v10.1.x; but still valid) https://github.com/akka/akka-http/blob/v10.2.7/akka-http-core/src/main/scala/akka/http/impl/engine/client/pool/NewHostConnectionPool.scala#L289
Isn't that a race condition?
If lines 271 and 288 are reached at the same time two invocations of updateState() will be running side-by-side and the outcome will be rather unpredictable.
currentTimeoutId
should be volatile, because it reads from a different thread then the one clearing/changing it, but even that won't cut it I fear. Seems like either it needs to involve checks and sets on both sides, in the right order (or some locking) or ensuring it happens inside the actor "event loop".
Or am I missing something?
Further study: If a materialized connectionFlow: Flow[HttpRequest, HttpResponse, Future[Http.OutgoingConnection]]
of HostConnectionPoolStage
, once connected, never produces an error. The slot will get stuck in WaitingForResponse(.
At that point, no timeout code is active in the connection pool.
This fix seems to acknowledge such a possibility. https://github.com/akka/akka-http/blame/v10.1.14/akka-http-core/src/main/resources/reference.conf#L342
Any chance that a heavy CPU load could make the 100 ms fall short. And if so, what is the risk of raising that value to say 20 seconds?
By now I have as much as a smoking gun. I'm able to reproduce by creating load on a local system and get some hanging slots.
(was trying to validate if those settings did something for me)
With debug logging enabled they spit out this
[435 (WaitingForResponse)] Connection cancelled
[534 (WaitingForResponse)] Connection cancelled
[449 (WaitingForResponse)] Connection cancelled
I found these by first looking at 1 hanging slot and finding that message, then looking for the message, which occurs the exact amount of times as the number of hanging slots (81 times).
When looking at the code, it seems to tell me that an attempt is made to send an HttpRequest to an HTTP stream, but that one can no longer process it so responds a cancellation instead. Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen. Which makes me think that the cancellation signal should be processed as some kind of failure instead of just being logged.
By now I've figured that thing is sort of the end of the line. But more is going on before it ever gets there. But I'm working on making that visible.
Ok, I went through 7 hells to exert the right amount of stress to reproduce the problem and at the same time not loose the debug logging, but I think I'm there.
Note: for things that have the same timestamp, you cannot trust the order. As it's retroactively gets sorted and there's no extra counter to ensure ordering
Time message thread_name
Dec 15, 2021 @ 21:42:29.298 [698 (Unconnected)] Dispatching request [GET /id/current/prpl Empty] listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.298 [698 (Unconnected)] Before event [onNewRequest] In state [Unconnected] for [105770 ms] listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299 [698 (Connecting)] After event [onNewRequest] State change [Unconnected] -> [Connecting] listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299 [698 (Unconnected)] Establishing connection listofnumbers-akka.actor.default-dispatcher-33
Dec 15, 2021 @ 21:42:31.501 [698 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting] listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:31.501 [698 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2201 ms] listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:38.903 [698 (Connecting)] Connection attempt succeeded listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903 [698 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [7402 ms] listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903 [698 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse] listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903 [698 (Connecting)] Slot connection was established listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903 [698 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection] listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903 [698 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms] listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:43:03.183 [698 (WaitingForResponse)] Connection cancelled listofnumbers-akka.actor.default-dispatcher-49
Dec 15, 2021 @ 21:43:48.193 [698 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse] listofnumbers-akka.actor.default-dispatcher-47
Dec 15, 2021 @ 21:43:48.193 [698 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [69286 ms] listofnumbers-akka.actor.default-dispatcher-47
I'm still in the process of analyzing what that sequence of events means, but for blocking slots it fairly consistently looks like this, so I thought I could share it already.
[449 (WaitingForResponse)] Connection cancelled
is maintained as tell for a blocking slot.
Thanks for this analysis, @spangaer. I'm going to have a quick look if anything stands out. Have you already tried with Akka HTTP 10.2.7? In any case, a fix will only appear in 10.2.7 as 10.1.x is now in deep maintenance mode.
Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen.
Yes, that is the assumption and it is in many places. Even if we would treat cancellation as an error in the pool, we would have the connection still being active which could be a kind of memory leak.
This fix seems to acknowledge such a possibility.
v10.1.14
/akka-http-core/src/main/resources/reference.conf#L342 (blame)Any chance that a heavy CPU load could make the 100 ms fall short. And if so, what is the risk of raising that value to say 20 seconds?
You can certainly try, it might make keep some things alive for a bit longer, but that shouldn't be a problem. Not sure that it would make a difference, because it might still not explain why the completion/failure is missing/hanging.
Would it be possible to make a tcpdump and send it to me? You could use tcpdump -i <interface> -s 100 -w <file>
. This would only keep 100 bytes at max per packet. A memory dump might also help. (No promises on when I might get back to this, though, will be off for most of the weeks to come).