micrometer icon indicating copy to clipboard operation
micrometer copied to clipboard

Micrometer creating too many SDK threads despite using NIO cloudWatchAsyncClient

Open samarprakash opened this issue 3 years ago • 3 comments

Describe the bug we create a CloudWatchAsyncClient which Micrometer "CloudWatchMeterRegistry" class expect us to pass for pushing metrics to AWS cloud watch. we don't specify any custom "asyncConfiguration" while creating a client. Now when we push the events to Cloudwatch using the Micrometer library, we noticed that lot of threads are created with the name "sdk-async-response".

What we understand is how the AWS SDK library works and it is as follows: The AWS SDK Async library uses Netty underlying for the IO calls and then underneath to complete the futures returned from the HTTP client request through to the Netty client it by default creates an ExecutorService to delegate any "whenComplete/thenApply etc works". This abstraction basically reduces the risk of an application breaking the async process if someone continues on NIO threads and starts blocking work. The functionality is well explained here - https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/asynchronous.html

The issue that we are trying to highlight is Micrometer when it completes the IO call, post complete work is unnecessarily switched to a different thread which is "sdk-async-response" in this case. If you see the implementation of Micrometer here (full block of code) - https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-cloudwatch2/src/main/java/io/micrometer/cloudwatch2/CloudWatchMeterRegistry.java#L113-L131 the "whenComplete" part is running on "sdk-async-response" thread whereas the main thread is waiting until the latch is counted down and when that happens the remaining operation anyway continues on the main thread. So the point is Micrometer unnecessarily switched from "aws-java-sdk-NettyEventLoop" -> "sdk-async-response" thread only in "whenComplete" block of code.

Think of a high volume analytics system that pushes Millions of messages to Cloudwatch, the Context switches can have severe overhead and especially in this particular case where we aren't gaining anything at all. We expect Micrometer to continue operation on the same Netty "aws-java-sdk-NettyEventLoop" because I don't see any blocking work done by Micrometer. If you see the block of code from "CloudWatchMeterRegistry" at line 128, Micrometer anyway waiting on the main thread and once the latch is released Micrometer continue the remaining operation on the same main thread. So what's the use of unnecessarily switching to the "sdk-async-response" thread? especially when the "whenComplete" of the CompletableFuture is not doing any significant work, all it does is check response/error and then log.

I've put down a reproducible code with sysout and I hope that clears the issue we are trying to highlight.

The other way to fix the issue is for us as a consumer of the Micrometer library to specify our own async configuration something like below to force the Micrometer to continue on the same Netty "aws-java-sdk-NettyEventLoop" thread but in future, if Micrometer does any blocking work post broadcasting the metrics to Cloudwatch then it can slow down or it can severely impact our system performance. However, I don't see any reason for Micrometer to do any such blocking work post-IO calls but still, we wanted to have confirmation from the Micrometer team what's the alternative here? Or the solution proposed is the only way to continue?

.asyncConfiguration(ClientAsyncConfiguration.builder()
                    .advancedOption(SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR, Runnable::run)
                    .build())

Environment

  • Micrometer version [e.g. 1.8.4]
  • Micrometer registry [e.g. cloudwatch]
  • OS: [e.g. any os version where Netty library works]
  • Java version: [e.g. java 11]

To Reproduce

cloudWatchAsyncClient.putMetricData(putMetricDataRequest).whenComplete((response, t) -> {
         System.err.println("inside whenComplete of cloudwatch by thread: " + Thread.currentThread().getName());
            if (t != null) {
                if (t instanceof AbortedException) {
                    logger.warn("sending metric data was aborted: {}", t.getMessage());
                } else {
                    logger.error("error sending metric data.", t);
                }
            } else {
                logger.debug("published metric with namespace:{}", putMetricDataRequest.namespace());
            }
            latch.countDown();
        });
        try {
            @SuppressWarnings("deprecation")
            long readTimeoutMillis = config.readTimeout().toMillis();
            System.err.println("waiting for cloudwatch to complete by thread: " + Thread.currentThread().getName());
            latch.await(readTimeoutMillis, TimeUnit.MILLISECONDS);
            System.err.println("continuing remaining work by thread: " + Thread.currentThread().getName());
        } catch (InterruptedException e) {
            logger.warn("metrics push to cloudwatch took longer than expected");
            throw e;
        }
    }

Expected behaviour I expect the Micrometer to continue on the same Netty "aws-java-sdk-NettyEventLoop" thread rather than doing a context switching to "sdk-async-response".

Additional context

  1. The code which can explain the issue: https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-cloudwatch2/src/main/java/io/micrometer/cloudwatch2/CloudWatchMeterRegistry.java#L113-L131

  2. The way AWS SDK Async library works - https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/asynchronous.html

  3. The way AWS SDK switches to internally defined executor - https://github.com/aws/aws-sdk-java-v2/blob/918aba21bc05e1c22ce8018fde081224a1f7bee3/core/sdk-core/src/main/java/software/amazon/awssdk/core/client/builder/SdkDefaultClientBuilder.java#L269-L277

  4. The way we create CloudwatchAsyncClient is

static final CloudWatchAsyncClient CLIENT = CloudWatchAsyncClient.builder()
            .overrideConfiguration(configuration -> RetryPolicy.builder()
                    .retryCondition(OrRetryCondition.create(
                            RetryCondition.defaultRetryCondition(),
                            RateLimitRetryCondition.getInstance()))
                    .backoffStrategy(BackoffStrategy.defaultStrategy())
                    // if retry due to throttling - RetryOnThrottlingCondition
                    .throttlingBackoffStrategy(BackoffStrategy.defaultThrottlingStrategy())
                    .numRetries(MAX_RETRY)
                    .build()
            ).build();

final CloudWatchConfig config = CommonCloudwatchConfig.builder()
                    .appName("some-app-name")
                    .prefix("some-prefix")
                    .build();
            return new CloudWatchMeterRegistry(config, Clock.SYSTEM, CloudWatchClientFactory.asyncClient());

samarprakash avatar Apr 14 '22 06:04 samarprakash

@shakuzen, any updates on this issue?

samarprakash avatar Apr 21 '22 06:04 samarprakash

closing this since there is no update - we will fix it on our end

samarprakash avatar May 02 '22 13:05 samarprakash

Hey @samarprakash,

Thank you for opening this issue and sorry for the late response. I think I understand your issue and I think we should think about fixing this. But before we go there could you please help us to understand the "seriousness" of this issue? (I'm ok fixing it if we can even if it is not so serious, I'm just curious about the results in case you have measurements.)

Here's what I think about the "seriousness" of this issue:

Think of a high volume analytics system that pushes Millions of messages to Cloudwatch, the Context switches can have severe overhead and especially in this particular case where we aren't gaining anything at all.

I think it worth to call out that this happens when you publish metrics and shipping the data is batched (you can change the batch size) so it should not happen very frequently.

We expect Micrometer to continue operation on the same Netty "aws-java-sdk-NettyEventLoop"

AmazonCloudWatchAsync does hide the underlying solution, you don't see anything Netty-related in its public API, you don't know that Netty is used under the hood and you don't have access to it (also, I guess Amazon can change this any time because it is an internal implementation detail). With this in mind, I think using the default behavior is not a bad idea to do but of course we can improve this if needed. Also, please notice the javadoc's opinion about using SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR:

You should typically only want to customize the future-completion Executor for a few possible reasons: [...] You know, for certain, that all of your CompletableFuture usage is strictly non-blocking, and you wish to remove the minor overhead incurred by using a separate thread. In this case, you can use Runnable::run to execute the future-completion directly from within the I/O thread.

I would like to highlight the "minor overhead" part. Of course it is just the opinion of the javadoc and if you measured performance impact because of this, at the moment I don't see a reason why shouldn't we change this.

Here's what I think about how to fix this issue:

As far as I can tell the (only?) way to fix this is setting the SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR option to the builder (CloudWatchAsyncClientBuilder) but right now Micrometer does not know about the builder, it is the responsibility of the user to build the client. Because this is something that is out of our reach right now, I think the simplest and fastest thing to do is calling this potential issue out in our docs. The other thing we might be able to do is accepting a builder in a new ctor and setting the option above. I'm also thinking if doing this could have negative consequences to other clients who do not want their builder to be modified. What do you think? I'm happy to reopen this.

jonatan-ivanov avatar May 04 '22 19:05 jonatan-ivanov

There was no interest in the issue since May, 2022. We’ll close it for now, and if there still will be any interest, please comment on this issue and we’ll consider re-opening it.

marcingrzejszczak avatar Dec 20 '23 16:12 marcingrzejszczak