spring-cloud-commons icon indicating copy to clipboard operation
spring-cloud-commons copied to clipboard

HealthCheckServiceInstanceListSupplier.healthCheckFlux emitting multiple values in-between the given intervals.

Open brizo001 opened this issue 3 years ago • 0 comments

This issue is present in version 3.1.3 and others.

My use case is simple. In my custom load-balancer I use the HealthCheckServiceInstanceListSupplier to grab a list of available instances, compare that list to all known instances and make a determination on whether to route to that instance or the next available one. I'm making assumptions that this list will be generated every spring.cloud.loadbalancer.health-check.interval

The issue I'm having is that the flux emits values of this list multiple times during that interval. It seems like it's every time isAlive mono reports a value. The problem is that there are points in time which the list still does not have all the alive values which means we are making the determination that the instance is down on incomplete data, this is leading us to send incorrect metrics and re-routing requests when they should not be. What I'm expecting is that a the resulting list will be generated at most every interval.

Sample I've added additional logging to identify the issue. I've subscribed to the flux and i'm outputting the difference between all the emitted values and the time since the last emitted value. For this case I've set spring.cloud.loadbalancer.health-check.interval to 25s

#TESTING #ERROR #INSTANCEDIFF 25 Seconds since Last Update - KnownInstances: [0, 1, 2], CurrentInstances: [1, 0, 2], IncomingInstances: [1], DownInstances: [0, 2] Diff: + [] | - [0, 2]

  • The value is emitted at the expected interval but reports only 1 of the 3 known instances as alive. Shortly after we get updates to the list.

#TESTING #ERROR #INSTANCEDIFF 0 Seconds since Last Update - KnownInstances: [0, 1, 2], CurrentInstances: [1], IncomingInstances: [1, 0], DownInstances: [2] Diff: + [0] | - []

  • After the initial value is emitted we can see other values are quick to follow.

#TESTING #ERROR #INSTANCEDIFF 0 Seconds since Last Update - KnownInstances: [0, 1, 2], CurrentInstances: [1, 0], IncomingInstances: [1, 0, 2], DownInstances: [] Diff: + [2] | - []

  • Finally we say that all known instances are up but by this point we could have processed requests and routed them incorrectly.

In the time between all these reports there could been many requests and therefor those requests would have been routed incorrectly. (This issue is more apparent during PSR or high load scenarios)

The good news.. I believe i found the exact point where the bug is and have a fix. I have tested this change locally and it did the trick.

What i believe is happening here is that it's not waiting for all results to be completed before emitting a value. ** Bug https://github.com/spring-cloud/spring-cloud-commons/blob/main/spring-cloud-loadbalancer/src/main/java/org/springframework/cloud/loadbalancer/core/HealthCheckServiceInstanceListSupplier.java#L140-L144

			List<ServiceInstance> result = new ArrayList<>();
			return Flux.merge(checks).map(alive -> {
				result.add(alive);
				return result;
			}).defaultIfEmpty(result);

** Fix return Flux.merge(checks).collectList();

brizo001 avatar Jun 13 '22 19:06 brizo001