gimbal icon indicating copy to clipboard operation
gimbal copied to clipboard

Discoverer should not retry deleting a service/endpoint that does not exist

Open alexbrand opened this issue 6 years ago • 12 comments

While doing performance testing, I noticed that the discovery components are wasting cycles trying to delete services/endpoints that no longer exist. This is currently impacting other tests (which can be thought of as other gimbal teams) because the discoverer is busy trying to delete non-existent services, instead of discovering new services for other teams.

How does this happen? The performance test creates a namespace where it creates hundreds of services and endpoints in the backend cluster. These are then discovered by Gimbal and everything is fine.

Once the test is done, instead of deleting services/endpoints one by one, the performance test script destroys the test namespace on both the gimbal cluster and the backend cluster. The deletion of the namespace in the backend cluster results in watch notifications sent to the discoverers about the deletion of services. The problem is that the services/endpoints have already been deleted in the gimbal cluster, because the entire namespace was deleted.

alexbrand avatar Jun 01 '18 19:06 alexbrand

How many threads are set up on your discoverer? Could you retest with a higher number? Just curious how that might have (a positive) effect on the performance of the queue.

stevesloka avatar Jun 01 '18 19:06 stevesloka

Originally had 2, bumped it up to 16 but still seeing this issue. Will try bumping up even higher, but not sure how much it'll help.

alexbrand avatar Jun 01 '18 21:06 alexbrand

Having more should process the queue faster since they should operate in parallel. But just a thought before going down a design route to see how to resolve.

stevesloka avatar Jun 01 '18 21:06 stevesloka

I haven't thought through the implications, but I think we should be able to swallow the "does not exist" error when we are trying to delete, instead of raising it up to the queue for re-queuing.

What's unclear to me is if we can naturally get into this state by somehow processing a delete notification before processing the corresponding add, and thus end up with an extra service that should have been deleted.

alexbrand avatar Jun 01 '18 21:06 alexbrand

These charts are interesting.

We can see when the namespace is deleted and all services are gone, but the discoverer is still trying to delete them, and thus accumulating errors.

Looks like it took approximately 20 minutes to drain the queue.

image

alexbrand avatar Jun 01 '18 21:06 alexbrand

Memory usage of discoverer:

image

alexbrand avatar Jun 01 '18 21:06 alexbrand

Ohh I understand now, I didn't realize that we were re-queuing each time on error. I think we could handle the "does not exist" or put a max retry on the queue, once an item is attempted 2 times, then just drop from the queue. That would solve the same scenario where maybe an update or add errors (for whatever reason) and we keep trying.

stevesloka avatar Jun 02 '18 00:06 stevesloka

Yeah, I actually think we need both... It doesn't look like we have a max retry on the queue item. I'll pick this up and look into it more.

alexbrand avatar Jun 04 '18 13:06 alexbrand

Not sure if we'd want a retry metric or not t track how many items are re-tried, or thrown out because if error.

stevesloka avatar Jun 04 '18 13:06 stevesloka

Now that we know we were not actually retrying before #141, the issue title was somewhat misleading.

What was actually going on is that the namespace is deleted from under the discoverer on both clusters, resulting in the API server from the backend cluster sending a long list of DELETE watch notifications to the discoverer. This is the expected behavior.

Now that we have added retries though, we should probably not attempt to retry a deletion for a service or endpoint that does not exist.

Another idea (which needs a bit more thought): Have a separate queue per namespace, so that the operations performed by one team do not affect other teams. This would be a larger change that requires discussion and design though (cc @rosskukulinski @stevesloka).

alexbrand avatar Jun 07 '18 18:06 alexbrand

The more I think about this, the more I am second guessing whether we should do it. It seems like the deletion of a namespace with thousands of services is an unlikely scenario when running Gimbal in production. Is that a reasonable statement? @stevesloka @rosskukulinski

alexbrand avatar Jun 07 '18 19:06 alexbrand

My guess that thousands of services within a namespace is unlikely. I would think though, that bringing up a new cluster could see a high input of data, also, anytime the re-sync window is hit on a large cluster.

This was my reason for the re-queue param, you may not want to retry more than once if something fails.

stevesloka avatar Jun 07 '18 19:06 stevesloka