kubernetes_asyncio icon indicating copy to clipboard operation
kubernetes_asyncio copied to clipboard

resource_version set incorrectly while watching list_* functions

Open JacobHenner opened this issue 3 years ago • 9 comments

Watch objects maintain a resource_version field to allow the watch to resume at the last observed point after a timeout. However, there is a bug preventing this from working as intended. Instead of tracking the resource_version of the list (from ListMeta), the field is updated with the resource_version of the resource object included in each event:

https://github.com/tomplus/kubernetes_asyncio/blob/b33cde2519a5aebf8db22f3308af79885ad16d11/kubernetes_asyncio/watch/watch.py#L106-L108

When the library later attempts to use an object resource_version as a parameter to the list_* function the watcher is operating on, the list function will return a 410 Gone, since that resource_version refers to an object and not a list.

I believe there are two actions to take here:

  1. Enable watch bookmarks and update the value of resource_version using those bookmark responses instead of the resource_version values from the ordinary event objects.
  2. Incorporate automatic retries for 410 errors. I'll open a separate issue for this.

JacobHenner avatar Feb 13 '21 22:02 JacobHenner

Hi @JacobHenner

Using the resource version was also discussed in the official library: https://github.com/kubernetes-client/python/issues/819

We can use the resource version as you suggest and implement the list-and-watch pattern but the question is if we can use a resVer from arrived events to implement list-watch-watch-... pattern?

IMO bookmarks can be use to some optimization only. Watch should work even if API Server doesn't support it.

410 - definitely has to be ported from the official repository. Are you going to work on it?

Thanks for reported issues :+1:

tomplus avatar Feb 14 '21 01:02 tomplus

I'll try to look at these three issues tomorrow.

We can use the resource version as you suggest and implement the list-and-watch pattern but the question is if we can use a resVer from arrived events to implement list-watch-watch-... pattern?

I'll take a look at the Go library to see how it is handled there. Perhaps that'll be helpful.

IMO bookmarks can be use to some optimization only. Watch should work even if API Server doesn't support it.

Indeed, I will keep this in mind.

JacobHenner avatar Feb 14 '21 01:02 JacobHenner

So after conducting some more research1, it seems as if the client-go library maintains a distinction between watches and informers. The former is the low-level operation, the latter is a high-level abstraction that wraps a watch and provides a cache, handles watcher reconnection, etc. This concept does not exist in the OpenAPI-derived libraries by default, although some of them (such as the go one) have decided to implement a similar feature.

For Python, there is an open issue for supporting informers in the official library: https://github.com/kubernetes-client/python/issues/868. It seems to be stalled, and there's an outstanding question of whether higher-level concepts belong in the OpenAPI-derived libraries (or if they should be packaged separately).

To resolve this issue, I first propose:

  1. The internal resource_version of the watcher object is set to the resource_version of the first returned list. This will prevent reconnection issues when timeouts occur, as long as the resource_version is still valid.

Then, we consider if we should:

  1. Close #136 without addressing it at the watcher level, and let callers implement the higher-level behavior they need. Once the watch receives the 410 Gone response, retrying it with the same resource_version seems unproductive. Looking at it again, I'm unsure what the PR I linked to in #136 accomplished - maybe it was an attempt to accommodate eventual consistency? It looks like one of the reviewers had concerns of whether this type of retry should be built-in.
  2. Handle #136 partially, by adding support for watch bookmarks. If available, the watch's internal resource_version would be set to the resourceVersion specified in bookmark events, and retry with the updated resourceVersion upon receiving a 410 Gone. If no bookmark events were received, the retry would fail, and the caller would need to handle the failure.
  3. Close #136 without addressing it at the watcher level, and implement an informer abstraction that would handle watch retries, bookmarks, caching, etc. Determine if this is a feature that should be specific to this async library, or if it's something that should be shared or similar between this library and the official one.

Once a decision is made, I'll start working on a PR. Options 1 and 2 would be straightforward to handle. Option 3 would require some additional discussion.

JacobHenner avatar Feb 14 '21 19:02 JacobHenner

I've written a small proof-of-concept informer-like wrapper. I have to finish adding some features and testing the implementation before sharing, but perhaps it'd be useful here. So far it has support for bookmarks, reconnections on 410s, and caching. The cache is used to avoid replaying or missing events that occurred between the time a watch went invalid and the time a replacement watch was started.

Assuming it is useful, I have two follow-up questions:

  1. Would that kind of wrapper be considered for inclusion in this library, or does it belong in a separate library? (This question was asked for informers in the official Python client as well, see link above).
  2. If this were to be included, would #134 and #136 still need to be handled in the current implementation, or would the advice be to use the higher-level abstraction unless the user intends to handle 410s and error events on their own?

JacobHenner avatar Feb 15 '21 04:02 JacobHenner

Informer sounds as a really good idea. It looks it's the only way to handle reconnection, using resource version properly without notifying about the same events. IMO it should be a part of the library as a go-client has it. I'm also not sure if its possible to create one solution for both libraries. I'm afraid it has to be ported/synced manually from one to another as we port Watch() and other components.

We can advice how to deal with reconnection in Watch() using some examples and recommend to use your informer as an efficient way to observe K8s' objects.

Let me know if I can help you with this feature. Thanks.

tomplus avatar Feb 15 '21 11:02 tomplus

So far I've implemented an informer-like wrapper which handles 410 Gone reconnections, bookmarking, and basic event-handling. I'm still working to figure out how closely to align it with the Informer implementations in the client-go, Java, and OpenAPI go libraries. Both client-go and the Java have complex implementations with many abstractions. I'm not yet sure the extent to which that's necessary here, and I'm thinking about how to make the implementation extensible so it can be published incrementally.

As for the original topic presented in this issue, I will provide feedback after thinking about it some more. I believe that some conditions should be handled within the watch itself and other should be handled in higher-level abstractions, but I haven't yet figured out which features fall in which category. When my changes are ready for PR, perhaps the location of some of the reliability measures can be discussed.

JacobHenner avatar Feb 15 '21 22:02 JacobHenner

Didn't have as much time as I hoped to work on this over the week. Will send something over once ready. My initial implementation has been focused on increasing reliability and providing an informer-like object with similar but not exactly the same semantics as client-go. It's possible that some of the changes introduced should be moved into the watcher itself. I'll comment on the PR in those places once it's available.

I'm now looking through client-go to determine if I should modify my current implementation to be more similar, or if I should continue as-is. I want to make sure I haven't ignored any intentional stability or "correctness" decisions made by the client-go devs.

JacobHenner avatar Feb 22 '21 01:02 JacobHenner

Apologies for not providing an implementation sooner, my priorities have changed and I haven't had much time to work on this.

I've published an initial implementation of my idea in a separate project, since I'm not certain of which parts should be included in the library's Watch class, and which should remain separate. It can be reviewed here, with an example here.

I've commented on the sections that might be considered for inclusion in the Watch class. After that's settled, we can determine whether or not the informer belongs within this library, or separate.

JacobHenner avatar Apr 17 '21 22:04 JacobHenner

Thanks for your reply.

I agree with you, the part of your library related to the Watch can be moved here to fix the base library - better reconnect handling, supporting bookmarks etc. The informer implementation can be developed as a separate library.

Let me know if I can help in any way.

tomplus avatar May 03 '21 23:05 tomplus