micrometer icon indicating copy to clipboard operation
micrometer copied to clipboard

Step meters do not push incomplete step values on application shutdown

Open timtebeek opened this issue 5 years ago • 20 comments

Hi! Had posted this in Slack, but thought to report it here as well pending response there

In short: I have a Spring Boot 2.2.5 application with spring-boot-starter-actuator and micrometer-registry-elastic; when I set management.metrics.export.elastic.enabled=true, then my counters do not increment anymore.

Not sure if this should be a micrometer issue or related to Spring Boot; ElasticMeterRegistry is in micrometer whereas ElasticMetricsExportAutoConfiguration is in Spring Boot. But since the latter seems to mostly delegate and is rather minimal otherwise, I thought to report it here first.

To replicate the problem:

  1. Create a new project with: https://start.spring.io/#!type=maven-project&language=java&platformVersion=2.2.5.RELEASE&packaging=jar&jvmVersion=11&groupId=com.example&artifactId=demo&name=demo&description=Demo%20project%20for%20Spring%20Boot&packageName=com.example.demo&dependencies=actuator

  2. Add a dependency on micrometer-registry-elastic

<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-registry-elastic</artifactId>
</dependency>
  1. Create a test class containing
@Test
void counterStaysAtZero() {
	DebugApplication.main("", "--management.metrics.export.elastic.enabled=true");
	// DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
	// DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
}
@Test
void counterAddsToOne() {
	DebugApplication.main("", "--management.metrics.export.elastic.enabled=false");
	// DebugApplication : Counter: 0.0 , [Measurement{statistic='COUNT', value=0.0}]
	// DebugApplication : Counter: 1.0 , [Measurement{statistic='COUNT', value=1.0}]
}
@SpringBootApplication
@RequiredArgsConstructor
@Slf4j
public static class DebugApplication implements ApplicationRunner {
	public static void main(String... args) {
		log.info("{}", args[1]);
		SpringApplication.run(DebugApplication.class, args).close();
	}
	private final MeterRegistry registry;
	@Override
	public void run(ApplicationArguments args) throws Exception {
		Counter counter = registry.counter("someCounter");
		log.info("Counter: {} , {}", counter.count(), counter.measure());
		counter.increment();
		log.info("Counter: {} , {}", counter.count(), counter.measure());
	}
}

If you then run both tests you will get the log output that's listed in each test method.

I would expect the Counter to increment regardless of whether metrics are exported; What can be done to make this so the counters increase?

timtebeek avatar Mar 06 '20 11:03 timtebeek

@timtebeek Thanks for the issue!

ElasticMeterRegistry is a StepMeterRegistry which uses StepCounter for Counter whereas SimpleMeterRegistry, which is used when there's no backend in Spring Boot, uses CumulativeCounter. That's the reason why you see the behavioral difference. StepCounter reports the number of events in the last complete interval as mentioned in its Javadoc, so if you check the value after its step, you can see what you expected.

I updated your sample to demonstrate what I said here.

izeye avatar Mar 07 '20 00:03 izeye

Thanks for the quick response @izeye ! Had no idea StepMeterRegistry and StepCounter were involved to give this effect. It affects us most for short lived processes/jobs that increment a counter at the end of their run, before a last gasp push of metrics to Elastic. In these cases the behaviour we're seeing is that the counters are created with value 0, but aren't actually incremented before the final push and the process dies.

Are you suggesting we add a step sized (~1 minute) sleep to the end of our jobs to ensure the incremented counter values are pushed?

Or is there another way to force pushing all "current" values as a program is shutting down, rather than wait for a step duration?

timtebeek avatar Mar 07 '20 12:03 timtebeek

@timtebeek The step-sized sleep was just for the demonstration, but I couldn't think of any way to push a value in an incomplete step with the current implementation except waiting for the completion of the step.

izeye avatar Mar 07 '20 13:03 izeye

Can confirm adding the one minute sleep at the end of our jobs resolved the immediate issue we have with counters not showing up incremented, so thanks for that @izeye !

Now you've indicated that there's presently no way to push a value in an incomplete step, even when an application is shutting down and metrics are pushed as a result of that shutdown hook. Would it be worthwhile exploring such a use-case and keep this ticket open? Or do you suggest we close this one?

timtebeek avatar Mar 09 '20 08:03 timtebeek

@timtebeek Thanks for the confirmation!

I think it would be nice if we could push a value in an incomplete step rather than waiting for a step to be completed in case of shutdown although I don't know how we could achieve it at the moment.

@shakuzen Thoughts?

izeye avatar Mar 09 '20 10:03 izeye

Perfect thanks; I've (hopefully) updated the title to reflect this change in scope (feel free to refine as needed)

As a user of the micrometer for jobs it would be great if on application shutdown any values in incomplete metric steps are still pushed to Elastic or whatever else is configured.

timtebeek avatar Mar 09 '20 11:03 timtebeek

I believe we have a previous ticket along these same lines - it came up for Influx as the backend if I remember correctly, but same use case. I do think we should have a better solution for this. I'll link the issues and add it to an upcoming release once I get done with the current release crunch.

shakuzen avatar Mar 10 '20 02:03 shakuzen

Awesome, thanks for considering such a change!

timtebeek avatar Mar 10 '20 06:03 timtebeek

Would really appreciate this one.

We run a big-data processing platform with a large number of processes, both long and short lived (where short-lived is even <10s, and long-lived even several hours, and we can't easily assume upfront whether a process will be short or long living), and we essentially have the same issue here - metrics from short-lived processes are not being published, therefore we really only have a subset of our processes being measured.

Putting a thread sleep call is not ideal for us, as due to high number of short-lived processes we run, this might raise our costs significantly. I can also see that there is a possibility of reducing this step size, but then I'm concerned about potential performance impact on long-running processes.

@shakuzen I'm wondering if you could recommend a preferred approach here. Perhaps we can modify the step at runtime? I'm not sure however if it's realistically achievable. I'd really appreciate your feedback on this.

pnaw94 avatar Aug 27 '20 10:08 pnaw94

Dear @shakuzen

Any news about this feature? I think that Timers are in the same situation.

admoca60 avatar Jan 13 '21 13:01 admoca60

Hi @shakuzen
I have a similar use case but with Influx Meter Registry. In a graceful shutdown it don't report metrics due to incomplete step. I have really interested in send metrics with short-lived processes or restarted services use cases. It has any estimated date to this feature?

Regards

garridobarrera avatar Mar 11 '21 20:03 garridobarrera

Please review this PR https://github.com/micrometer-metrics/micrometer/pull/2519

garridobarrera avatar Mar 20 '21 23:03 garridobarrera

I've made a note for our 1.8 planning to holistically look at our support for metrics in short-lived processes.

shakuzen avatar Mar 24 '21 05:03 shakuzen

Thanks @shakuzen , is possible to include this support in 1.3.x too ?. It will be very useful to me has this feature. Regards

garridobarrera avatar Mar 24 '21 06:03 garridobarrera

This seems to be a bigger issue in Google Cloud Run. We are not given any guarantees when an instance will shutdown, and Google only gives you 10 seconds to execute your shutdown hooks. So if the step interval does not overlap with that 10 second window, those metrics are dropped.

This is a real problem for us because it seems like Cloud Run instances like to shut down much more frequently if requests are failing. So we are not receiving the right number of request failure metrics, and thus not triggering our alerts on internal production errors. This should be a very high priority issue.

ajs1998 avatar Dec 10 '21 22:12 ajs1998

@ajs1998, I think if you can get the instance of the CompositeMeterRegistry and then call close on it within your shutdown hook, you should be ok. I ran into this with Miconaut/Micrometer and it turned out the framework wasn't calling close.

tburch avatar Dec 11 '21 16:12 tburch

@ajs1998, I think if you can get the instance of the CompositeMeterRegistry and then call close on it within your shutdown hook, you should be ok.

I have tried this but it does not seem to help. I haven't done enough debugging to figure out why though. If I run the service locally, everything works fine but if it's run in cloud run, the reported metrics are very flaky

ajs1998 avatar Dec 11 '21 16:12 ajs1998

The only reliable way I got it working in AWS Lambda, prior to fixing the bug in Micornaut, was to publish metrics around every request that was handled.

tburch avatar Dec 11 '21 16:12 tburch

Hi, I've found a related issue about this issue.

I have a batch only increase a counter then exits. This will reproduce the existing issue here. if I make the batch to wait for 1 minute after execution, I will have duplicated metrics in the same minute. Because the first scheduled publish and the publishSafely() in close() in PushMeterRegistry both send the metrics in that minute. (Full PoC

This can be also fixed by the PR#2519 @garridobarrera send if I understand correctly. About the fix here, it requires the implementations (like New Relic..etc) using PushMeterRegistry / StepMeterRegistry to implement publishGracefulShutdown().

Is this approach to fix already confirmed? I haven't seen the PR merged.

I have another experiment fix to add a wait time on shutdown for the next metric publishing. This will make batches run longer (0 ~ stepMillis). But it won't require implementations to change their code.

joshua-paypay avatar Jun 14 '22 07:06 joshua-paypay

+1 on this issue. I have tried to diagnose this issue and have created a small example for this scenario here: https://github.com/pirgeo/micrometer-counter-exporter-example

It feels quite counter-intuitive to me that the values from the previous export interval are exported a second time.

Is this something you would expect to be fixed in the respective Registries, or are you considering a change to the underlying Counter instrument?

pirgeo avatar Aug 04 '22 13:08 pirgeo

@shakuzen This issue seems to be there for around 2 years. And it's really hurting the short-lived batches where the data is never reported more than the long-lived applications. I have added a PR which addresses this issue.

lenin-jaganathan avatar Nov 18 '22 05:11 lenin-jaganathan

@lenin-jaganathan I created a PoC to solve this issue but some more exploratory work is needed, will look into it more soon.

jonatan-ivanov avatar Nov 18 '22 19:11 jonatan-ivanov

Voting for this as well. This seems like a major flaw, you have to sleep at the end of each process termination or you will lose the last batch of meter values. This really kills things like spark where you are launching short lived processes to do some work and then terminating.

mbbeaubi avatar Feb 01 '23 16:02 mbbeaubi

Is this being considered for 1.11 or do you think this is further out?

pirgeo avatar Feb 03 '23 13:02 pirgeo

Is this being considered for 1.11 or do you think this is further out?

It's a high priority. We're planning on getting it into the 1.11 release. I've updated the milestone.

shakuzen avatar Feb 07 '23 06:02 shakuzen

Hey, I have a very similar problem during recording data by DistributionSummary and publishing using ElasticMeterRegistry with 1min step. So while fixing StepCounter, please take a look at StepDistributionSummary too.

PS. I have created a little, semi-workaround by adding my own forcePublish method, overriding super "publish" method and comment it's body. Then during every job invocation I call 'forcePublish' method (code below). It works like a charm, except first and second job invocation. During first job invocation and publish Counter is {count=0} and DistributionSummary is {count=0, max=(correct value), mean=0, sum=0}. Then second job invocation publish correct data from actual and before invocations. Next jobs invocations works/publish data correctly.

!!!This listing is not real in 100%. It's just to show idea.

public class MicrometerElasticRegistry extends ElasticMeterRegistry {
    public MicrometerElasticRegistry() {
        super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }
    
    public void forcePublish() {
        log.info("FORCE PUBLISH");
        super.publish();
    }

    @Override
    protected void publish() {
//        super.publish();
    }

}

    public static class TestApplication {
        public static void main(String[] args) {
            MicrometerElasticRegistry registry = new MicrometerElasticRegistry();
            Counter testCounter = Counter.builder("testCounter").register(registry);
            DistributionSummary testDist = DistributionSummary.builder("testDist").register(registry);
            for (int i = 0; i < 10; i++) {
                double value = new Random().nextDouble();
                invokeJob(registry, testCounter, testDist, value);
            }
        }

        public static void invokeJob(MicrometerElasticRegistry registry, Counter testCounter,
                                     DistributionSummary testDist, double value) {
            testCounter.increment(value);
            testDist.record(value);
            registry.forcePublish();
        }
    }

ppawlowski15 avatar Feb 09 '23 15:02 ppawlowski15

We've committed a fix for this that will go out shortly in the 1.11.0-M2 milestone release. We would really appreciate if folks affected by this could try out that milestone and give any feedback on this. Let us know if this is not working in some cases. We can consider backporting this fix to maintenance releases, but we want to gauge how much that is needed and have testing done by users on this in the milestone version first.

shakuzen avatar Mar 14 '23 04:03 shakuzen

Hey, Thanks for fix. It is better but not ideal.

In context of AWS Lambda, metrics still are not published on Lambda invocation shutdown.

But with this version I'd successfully created workaround and now its working correctly.

  1. As in previous post, I have forcePublish method which calls super.publish() in my registry class
public class MicrometerElasticRegistry extends ElasticMeterRegistry {
    public MicrometerElasticRegistry() {
        super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }
    
    public void forcePublish() {
        log.info("FORCE PUBLISH");
        super.publish();
    }
}
  1. Lambda flow
MicrometerElasticRegistry myRegistry = new MicrometerElasticRegistry();
Counter counter = Counter.builder("counterName").register(myRegistry);
counter.increment(100);
        
if (LambdaUtil.isColdStart())
     myRegistry.close();
else
    myRegistry.forcePublish();

So, as you can see, when it is Lambda Cold Start I call close() on myRegistry. And when it is Lambda Warm Start I call forcePublish() on myRegistry. If I will call myRegistry.forcePublish() during Lambda Cold Start metrics will be send with zeros. And similar zeros metrics will be send if I will call myRegistry.close() during Lambda Warm Start. This workaround is NOT working for 1.10.4 neither 1.10.5 -> Lambda Cold Starts sending metrics with 0 values (the same as I'd described in my previous post)

Hope that will give You at least track where the problem is.

Feel free to ask for more info ;)

ppawlowski15 avatar Mar 15 '23 01:03 ppawlowski15

What do you mean by "Lambda invocation shutdown" publishing metrics after each invocation or publishing them during JVM shutdown?

What is the context of the second code block ("lambda flow"), is it running for every request or is it running once for the lifetime of that JVM (e.g.: static initializer)?

Where is LambdaUtil coming from? Is that something you wrote?

I'm not sure what is the difference in terms of shipping metrics if the JVM is receiving the first request (cold start) or the Nth ("warm").

Could you please try initialize your registry in a static initializer and also register a shutdown hook Runtime.getRuntime().addShutdownHook(...) for closing it? Here's an example: https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/tree/main/java-demo

jonatan-ivanov avatar Mar 15 '23 17:03 jonatan-ivanov

I've changed/extended my code a little for better understanding.

MicrometerElasticRegistry class:

public class MicrometerElasticRegistry extends ElasticMeterRegistry {

    public MicrometerElasticRegistry() {
      super(ElasticConfig.DEFAULT, Clock.SYSTEM);
    }

    public void forcePublish() {
        super.publish();
    }

    public void incrementCounter(String metricsName, long count) {
        getCounter(metricsName).increment(count);
    }

    private Counter getCounter(String metricsName) {
        try {
            return this.get(metricsName).counter();
        } catch (MeterNotFoundException e) {
            return Counter.builder(metricsName).register(this);
        }
    }
}

Lambda main handler class (ver. 1):

@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
    private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
    private static boolean coldStart = true;

    @Override
    public Void handleRequest(KinesisEvent event, Context context) {
        try {
            metricsRegistry.incrementCounter("test_counter", new Random().nextInt(100) + 1);
        } catch (Exception ex) {
            log.error("Exception", ex);
        } finally {
            coldStart = false;
            metricsRegistry.close();
        }
        return null;
    }
}

With that code (ver. 1) Micrometer sends correct value of counter 'test_counter' only during cold start. On every Nth warm start Micrometer sends 0. I'm dont know why, but I think its strongly connected with static initializing MicrometerElasticRegistry which is done only on cold start. Shutdown hook is not working for me so I've decided to close registry in finally block which should work more/less similar.

If I apply my previous workaround to that code it will looks like this:

Lambda main handler class (ver. 2 - with workaround):

@Log4j2
public class MetricsLambdaConsumer implements RequestHandler<KinesisEvent, Void> {
    private static final MicrometerElasticRegistry metricsRegistry = new MicrometerElasticRegistry();
    private static boolean coldStart = true;

    @Override
    public Void handleRequest(KinesisEvent event, Context context) {
        try {
            metricsRegistry.incrementCounter("test.counter", new Random().nextInt(100) + 1);
        } catch (Exception ex) {
            log.error("Exception", ex);
        } finally {
            if (coldStart) metricsRegistry.close();
            else metricsRegistry.forcePublish();

            coldStart = false;
        }
        return null;
    }
}

And this code (ver. 2) will make Micrometer to send correct values on cold and warm starts.

ppawlowski15 avatar Mar 16 '23 10:03 ppawlowski15