reactor-core
reactor-core copied to clipboard
High performance degradation by instrumentation-type DECORATE_QUEUES
We had an issue with spring-sleuth context not been propagated to every thread so we switch the instrumentation type to DECORATE_QUEUES which solved this issue in our tests. In production we faced high performance degredation, long response times and failed requests (aborted SocketChannels etc.). The kubernetes pods requested more than 2 to 3 times CPU than the previous version with the default instrumentation type.
We see the following exceptions thrown:
Exception: io.netty.channel.StacklessClosedChannelException
Stacktrace:
io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:619)
io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352)
io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606)
io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.close(CombinedChannelDuplexHandler.java:505
io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:77)
io.netty.channel.CombinedChannelDuplexHandler.close(CombinedChannelDuplexHandler.java:316)
io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606)
io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:472)
io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:957)
io.netty.channel.AbstractChannel.close(AbstractChannel.java:244)
reactor.netty.http.client.HttpClientOperations.onInboundCancel(HttpClientOperations.java:275)
reactor.netty.channel.FluxReceive.unsubscribeReceiver(FluxReceive.java:476)
reactor.netty.channel.FluxReceive.lambda$
reactor.netty.channel.FluxReceive$$Lambda$.dispose
reactor.netty.channel.FluxReceive.cancelReceiver(FluxReceive.java:194)
reactor.netty.channel.FluxReceive.cancel(FluxReceive.java:97)
reactor.netty.channel.ChannelOperations.discard(ChannelOperations.java:366)
reactor.netty.channel.ChannelOperations.dispose(ChannelOperations.java:201)
reactor.core.publisher.MonoCreate$DefaultMonoSink.onCancel(MonoCreate.java:231)
reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onNext(HttpClientConnect.java:299)
reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onNext(HttpClientConnect.java:284)
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.lambda$
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber$$Lambda$.run
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.withActiveSubpath(TracingReactorSubscriber.java:99)
com.dynatrace.agent.introspection.reactor3.TracingReactorSubscriber.onNext(TracingReactorSubscriber.java:67)
reactor.core.publisher.MonoCreate$DefaultMonoSink.success(MonoCreate.java:165)
reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onStateChange(DefaultPooledConnectionProvider.java:182)
reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnection.onStateChange(DefaultPooledConnectionProvider.java:440)
reactor.netty.channel.ChannelOperationsHandler.channelActive(ChannelOperationsHandler.java:62)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209)
io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelActive(CombinedChannelDuplexHandler.java:412)
io.netty.channel.ChannelInboundHandlerAdapter.channelActive(ChannelInboundHandlerAdapter.java:69)
io.netty.channel.CombinedChannelDuplexHandler.channelActive(CombinedChannelDuplexHandler.java:211)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216)
io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelActive(DefaultChannelPipeline.java:1398)
Exception: reactor.netty.channel.AbortedExceptionMessage: Connection has been closed BEFORE send operation
Stacktrace:
reactor.netty.http.client.HttpClientOperations.send(HttpClientOperations.java:408)
org.springframework.cloud.gateway.filter.NettyRoutingFilter.lambda$
org.springframework.cloud.gateway.filter.NettyRoutingFilter$$Lambda$.apply
reactor.netty.http.client.HttpClientConnect$HttpClientHandler.requestWithBody(HttpClientConnect.java:583)
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver.lambda$
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver$$Lambda$.get
reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:44)
reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
reactor.netty.http.client.HttpClientConnect$HttpIOHandlerObserver.onStateChange(HttpClientConnect.java:424)
reactor.netty.ReactorNetty$CompositeConnectionObserver.onStateChange(ReactorNetty.java:677)
reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.run(DefaultPooledConnectionProvider.java:270)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:829)
Expected Behavior
As mentioned in the Documentation I excpected the same funtional behaviour with maybe slightly performance impact (the documentation mentions a LOW performance impact).
Actual Behavior
The CPU consumption of the application drastically increases with this instrumentation type and we see many Exceptions about closed and aborted SocketChannels.
Steps to Reproduce
Switch to instrumentation-type DECORATE_QUEUES and fire like 200 requests per second to one instance of the application.
Possible Solution
We took a short look into the implementation and thought about the usage of a double linked LinkedHashMap for this feature. Maybe a LinkedHashMap is not the right choice here?
Your Environment
The application routes requests depending on some parameters to other services (normally very short resquest/response-times). Some parameters are cached so we don't need to check them every time. When we need to check them (cache evicted) we explicitely start a new thread because the check consists of a synchronous API call:
.onCacheMissResume(() -> Mono.fromCallable(() -> kubernetesRepository.callTokenReviewEndpoint(authorizationHeader)) .subscribeOn(Schedulers.boundedElastic())) ....
In these cases the sleuth-context was missing with the default instrumentation-type. With decorate_queues the context is present.
- Reactor version(s) used: 3.4.15
- Other relevant libraries versions (eg.
netty, ...): Spring Boot 2.6.4, Spring Cloud 2021.0.1 - JVM version (
java -version): 11.0.16 - OS and version (eg
uname -a): Docker Image ubi8/openjdk-11
@jnt0r can you please provide a small isolated application that reflects your environment and demonstrates the mentioned problem?
Hi @OlegDokuka, currently I don't have time to create a sample application. It's basically a Spring Cloud Gateway application with two explicit thread jumps (as described in my initial description) where we switched the instrumentation-type from default to decorate_on_queues which lead to reproducible huge performance impacts (CPU usage, response time, ...). We think the implementation using a LinkedHashMap is causing the problem, as it requires many CPU intensive tasks.
We'll keep the issue in the backlog, if you find the time to prepare a reproducer, we'd be happy to have a look and profile. Without an isolated reproducer it will be difficult to address the issue.
We figured out that it was more of a documentation missmatch between reactor-core and spring cloud gateway. The defaults for the instrumentation-type differ because spring cloud gateway overrides the default to manual. That makes it clear why we faced such a huge performance impact by switching to decorate_queues. We thought we switched from decorate_on_each to decorate_queues but we actually switched from manual to decorate_queues.
Thank you for following up. That makes sense. I think we can consider closing the issue then?
Yes we can close the issue. I will suggest to point out the differences in the spring cloud gateway documentation more clearly.
Dariusz Jędrzejczyk @.***> schrieb am Do., 6. Okt. 2022, 12:10:
Thank you for following up. That makes sense. I think we can consider closing the issue then?
— Reply to this email directly, view it on GitHub https://github.com/reactor/reactor-core/issues/3158#issuecomment-1269742261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJHHNXGV6Y6YO5VWRAH6T2DWB2QRTANCNFSM565DDFMQ . You are receiving this because you were mentioned.Message ID: @.***>
@jnt0r please go ahead and file a report in https://github.com/spring-cloud/spring-cloud-gateway/issues then, thanks. I see @OlegDokuka already closed this issue.