spring-boot-admin icon indicating copy to clipboard operation
spring-boot-admin copied to clipboard

Help needed: Random TimeoutException

Open cdprete opened this issue 5 months ago • 6 comments

Hello.

I'm using SBA 3.5.1 with Spring Boot 3.5.3 / 3.5.4 both in a servlet and reactive environment. The registration and service discovery happens through Eureka.

From time to time, quite randomly to me I must say, I get the following STATUS_CHANGE in the UI

{
    "statusInfo": {
        "status": "OFFLINE",
        "details": {
            "exception": "java.util.concurrent.TimeoutException",
            "message": "Did not observe any item or terminal signal within 9000ms in 'log' (and no fallback has been configured)"
        }
    }
}

which the basically puts my app offline to just put them back online on the next run of the check. Can someone please provide some support on this? I've opened https://stackoverflow.com/questions/79655549/spring-boot-admin-tons-of-timeouts in the past, but I got no replies in there at all.

cdprete avatar Jul 29 '25 05:07 cdprete

Hello. I checked the docs and set the default-timeout to 30s, but then I got the warning

2025-07-29T06:53:27.193Z WARN 1 --- [Zookeepers Monitoring] [ main] d.c.b.a.s.c.AdminServerAutoConfiguration : Default timeout (PT30S) is larger than status interval (PT10S), hence status interval will be used as timeout.

in the logs and I see 2 issues with this:

  1. how much a request can take (timeout) and how often a request should be done (interval) are for me 2 different things that have different scopes.
  2. checking the code from where the warning comes from, the default timeout seems to be ignored anyway
@Bean(initMethod = "start", destroyMethod = "stop")
@ConditionalOnMissingBean
public StatusUpdateTrigger statusUpdateTrigger(StatusUpdater statusUpdater, Publisher<InstanceEvent> events) {
	AdminServerProperties.MonitorProperties monitorProperties = this.adminServerProperties.getMonitor();

	Duration defaultTimeout = monitorProperties.getDefaultTimeout();
	Duration statusInterval = monitorProperties.getStatusInterval();

	if (defaultTimeout.compareTo(statusInterval) > 0) {
		log.warn(
				"Default timeout ({}) is larger than status interval ({}), hence status interval will be used as timeout.",
				defaultTimeout, statusInterval);
	}

	return new StatusUpdateTrigger(statusUpdater, events, monitorProperties.getStatusInterval(),
			monitorProperties.getStatusLifetime(), monitorProperties.getStatusMaxBackoff());
}

@erikpetzold @SteKoe @ulischulte what are your thoughts about this? Am I missing something?

cdprete avatar Jul 29 '25 07:07 cdprete

Hi @cdprete ,

have a look at issue 3184. For interval based tasks like statusUpdate there are some limitations. The timeout cannot be longer than the interval, so the interval is the upper bound for the timeout. The default timeout here is only used to log a warn message when the timeout is larger than the interval. Since we're always using the statusInterval as a timeout here, the log message might be a bit misleading.

@SteKoe , @erikpetzold any thoughts on this?

ulischulte avatar Aug 01 '25 07:08 ulischulte

Hi @ulischulte.

I saw that ticket and, while it kind of explains the rationale behind it, I must say I disagree with it. Personally, I would expect these interval-like jobs to behave like Java's scheduleAtFixedDelay rather then as scheduleAtFixedRate.

The other part of the story would be to understand:

  1. why those requests go in timeout. When I call them myself from the browser they return in less than one second.
  2. if it would make more sense to have a read timeout setup at client level (I think there is a PR open for it) and not at Reactor level. Maybe, and I repeat maybe, connection-wise is all good and it's just Reactor that's struggling.
  3. if those timeout errors should be implicitly retried (I would say so).
  4. if a timeout error should really put an instance offline. I would say yes if, even after all the retries, it still timed out, but otherwise RESTRICTED would be a better option I think.

cdprete avatar Aug 01 '25 07:08 cdprete

Any updates on this?

cdprete avatar Sep 20 '25 08:09 cdprete

Hello @ulischulte.

These random timeouts are becoming more and more problematic in terms of reliability of the solution. I've now an application that's reported as up since 1 hour when it's actually up since 19+ days. If I check the journal, if we don't count the fact I can see only events from the 29th of Sept (basically yet another confirmation of https://github.com/codecentric/spring-boot-admin/issues/4389), I can see that the status changes as UP → OFFLINE (due to the timeout) → UP.

Image Image

cdprete avatar Sep 29 '25 12:09 cdprete

@ulischulte could we/you introduce retries in the StatusUpdater, maybe? It's just a GET call fetching the health of the instance, therefore it's safe to retry.

cdprete avatar Oct 25 '25 13:10 cdprete