python
python copied to clipboard
Old events from the past yielded due to remembered `resource_version`
Current behaviour
The list...()
operation (actually, the Kubernetes API) returns the resources in random arbitrary order. It can happen so that the very old resources (e.g. custom resources 1-month old) go last in the list.
In current implementation, the watcher-streamer remembers the resource_version of the last seen resource — in that case, the old resource versions (1-month old) will be remembered as the last seen one. And when the HTTP call is disconnected for any reason, the watcher-streamer starts a new one, using that remembered old resource_version as the base.
As a result, all the changes of all the resources of that resource kind are yielded: i.e. all happened in the past month, despite they were already yielded before (and presumable handled by the consumer). For the objects that were created since that old timestamp, it yields the ADDED & all MODIFIED events & event the DELETED events.
Example
Example for my custom resource kind:
In [59]: kubernetes.config.load_kube_config() # developer's config files
In [60]: api = kubernetes.client.CustomObjectsApi()
In [61]: api_fn = api.list_cluster_custom_object
In [62]: w = kubernetes.watch.Watch()
In [63]: stream = w.stream(api_fn, 'example.com', 'v1', 'mycrds')
In [64]: for ev in stream: print((ev['type'], ev['object'].get('metadata', {}).get('name'), ev['object'].get('metadata', {}).get('resourceVersion'), ev['object'] if ev['type'] == 'ERROR' else None))
('ADDED', 'mycrd-20190328073027', '213646032', None)
('ADDED', 'mycrd-20190404073027', '222002640', None)
('ADDED', 'mycrd-20190408065731', '222002770', None)
('ADDED', 'mycrd-20190409073007', '222002799', None)
('ADDED', 'mycrd-20190410073012', '222070110', None)
('ADDED', 'mycrd-20190412073005', '223458915', None)
('ADDED', 'mycrd-20190416073028', '226128256', None)
('ADDED', 'mycrd-20190314165455', '233262799', None)
('ADDED', 'mycrd-20190315073002', '205552290', None)
('ADDED', 'mycrd-20190321073022', '209509389', None)
('ADDED', 'mycrd-20190322073027', '209915543', None)
('ADDED', 'mycrd-20190326073030', '212318823', None)
('ADDED', 'mycrd-20190402073005', '222002561', None)
('ADDED', 'mycrd-20190415154942', '225660142', None)
('ADDED', 'mycrd-20190419073010', '228579290', None)
('ADDED', 'mycrd-20190423073032', '232894099', None)
('ADDED', 'mycrd-20190424073015', '232894129', None)
('ADDED', 'mycrd-20190319073031', '207954735', None)
('ADDED', 'mycrd-20190403073019', '222002615', None)
('ADDED', 'mycrd-20190405073040', '222002719', None)
('ADDED', 'mycrd-20190415070301', '225374502', None)
('ADDED', 'mycrd-20190417073005', '226917625', None)
('ADDED', 'mycrd-20190418073023', '227736631', None)
('ADDED', 'mycrd-20190327073030', '212984265', None)
('ADDED', 'mycrd-20190422061326', '230661413', None)
('ADDED', 'mycrd-20190318070654', '207313230', None)
('ADDED', 'mycrd-20190401101414', '216222726', None)
('ADDED', 'mycrd-20190320073041', '208884644', None)
('ADDED', 'mycrd-20190326165718', '212611027', None)
('ADDED', 'mycrd-20190329073007', '214304201', None)
('ADDED', 'mycrd-20190325095839', '211712843', None)
('ADDED', 'mycrd-20190411073018', '223394843', None)
^C
Please note the random order of resource_versions. Depending on your luck and current state of the cluster, you can get either the new enough, or the oldest resource in the last line.
Let's use the latest resource_version 223394843
with a new watch object:
In [76]: w = kubernetes.watch.Watch()
In [79]: stream = w.stream(api_fn, 'example.com', 'v1', 'mycrds', resource_version='223394843')
In [80]: for ev in stream: print((ev['type'], ev['object'].get('metadata', {}).get('name'), ev['object'].get('metadata', {}).get('resourceVersion'), ev['object'] if ev['type'] == 'ERROR' else None))
('ERROR', None, None, {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 223394843 (226210031)', 'reason': 'Gone', 'code': 410})
('ERROR', None, None, {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 223394843 (226210031)', 'reason': 'Gone', 'code': 410})
('ERROR', None, None, {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 223394843 (226210031)', 'reason': 'Gone', 'code': 410})
('ERROR', None, None, {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 223394843 (226210031)', 'reason': 'Gone', 'code': 410})
……… repeated infinitely ………
Well, okay, let's try the recommended resource_version, which is at least known to the API:
In [83]: w = kubernetes.watch.Watch()
In [84]: stream = w.stream(api_fn, 'example.com', 'v1', 'mycrds', resource_version='226210031')
In [85]: for ev in stream: print((ev['type'], ev['object'].get('metadata', {}).get('name'), ev['object'].get('metadata', {}).get('resourceVersion'), ev['object'] if ev['type'] == 'ERROR' else None))
('ADDED', 'mycrd-expr1', '226370109', None)
('MODIFIED', 'mycrd-expr1', '226370111', None)
('MODIFIED', 'mycrd-expr1', '226370116', None)
('MODIFIED', 'mycrd-expr1', '226370127', None)
('MODIFIED', 'mycrd-expr1', '226370549', None)
('DELETED', 'mycrd-expr1', '226370553', None)
('ADDED', 'mycrd-20190417073005', '226917595', None)
('MODIFIED', 'mycrd-20190417073005', '226917597', None)
('MODIFIED', 'mycrd-20190417073005', '226917605', None)
('MODIFIED', 'mycrd-20190417073005', '226917614', None)
('MODIFIED', 'mycrd-20190417073005', '226917625', None)
('ADDED', 'mycrd-20190418073023', '227736612', None)
('MODIFIED', 'mycrd-20190418073023', '227736613', None)
('MODIFIED', 'mycrd-20190418073023', '227736618', None)
('MODIFIED', 'mycrd-20190418073023', '227736629', None)
('MODIFIED', 'mycrd-20190418073023', '227736631', None)
('ADDED', 'mycrd-20190419073010', '228579268', None)
('MODIFIED', 'mycrd-20190419073010', '228579269', None)
('MODIFIED', 'mycrd-20190419073010', '228579276', None)
('MODIFIED', 'mycrd-20190419073010', '228579286', None)
('MODIFIED', 'mycrd-20190419073010', '228579290', None)
('ADDED', 'mycrd-20190422061326', '230661394', None)
('MODIFIED', 'mycrd-20190422061326', '230661395', None)
('MODIFIED', 'mycrd-20190422061326', '230661399', None)
('MODIFIED', 'mycrd-20190422061326', '230661411', None)
('MODIFIED', 'mycrd-20190422061326', '230661413', None)
('ADDED', 'mycrd-20190423073032', '231459008', None)
('MODIFIED', 'mycrd-20190423073032', '231459009', None)
('MODIFIED', 'mycrd-20190423073032', '231459013', None)
('MODIFIED', 'mycrd-20190423073032', '231459025', None)
('MODIFIED', 'mycrd-20190423073032', '231459027', None)
('MODIFIED', 'mycrd-20190423073032', '232128498', None)
('MODIFIED', 'mycrd-20190423073032', '232128514', None)
('MODIFIED', 'mycrd-20190423073032', '232128518', None)
('ADDED', 'mycrd-20190424073015', '232198227', None)
('MODIFIED', 'mycrd-20190424073015', '232198228', None)
('MODIFIED', 'mycrd-20190424073015', '232198235', None)
('MODIFIED', 'mycrd-20190424073015', '232198247', None)
('MODIFIED', 'mycrd-20190424073015', '232198249', None)
('MODIFIED', 'mycrd-20190423073032', '232894049', None)
('MODIFIED', 'mycrd-20190423073032', '232894089', None)
('MODIFIED', 'mycrd-20190424073015', '232894093', None)
('MODIFIED', 'mycrd-20190423073032', '232894099', None)
('MODIFIED', 'mycrd-20190424073015', '232894119', None)
('MODIFIED', 'mycrd-20190424073015', '232894129', None)
('ADDED', 'mycrd-20190425073032', '232973618', None)
('MODIFIED', 'mycrd-20190425073032', '232973619', None)
('MODIFIED', 'mycrd-20190425073032', '232973624', None)
('MODIFIED', 'mycrd-20190425073032', '232973635', None)
('MODIFIED', 'mycrd-20190425073032', '232973638', None)
('MODIFIED', 'mycrd-20190314165455', '233190859', None)
('MODIFIED', 'mycrd-20190314165455', '233190861', None)
('MODIFIED', 'mycrd-20190314165455', '233254055', None)
('MODIFIED', 'mycrd-20190314165455', '233254057', None)
('MODIFIED', 'mycrd-20190314165455', '233262797', None)
('MODIFIED', 'mycrd-20190314165455', '233262799', None)
^C
All this is dumped immediately, nothing happens in the cluster during these operations. All these changes are old, i.e. not expected, as they were processed before doing list...()
.
Please note that even the deleted non-existing resource are yielded ("expr1").
Dilemma
See https://github.com/kubernetes-client/python-base/pull/131 for a suggested implementation of the monotonically increasing resource_version as remembered by the watcher.
However, one of the unit-tests says about the resource version:
rv must be treated as an opaque value we cannot interpret it and order it so rely on k8s returning the events completely and in order
Kubernetes does not keep the promise, and returns the events in random order.
Way A: If the client library starts interpreting the resource versions, and to remember the maximum value seen, it can break its compatibility with kubernetes.
Way B: If the client library decides to treat the resource version as opaque and non-interpretable, it should also stop remembering it, as it leads to the re-yielding of the events from the past (long ago), as demonstrated above.
In the latter case, all resource_version
support should not be in the Watch.stream()
at all, and only the users of the watcher-streamer should decide on whether they are interpreting the resource version or not, and to track it by their own rules (at their own risk).
Possibly related
- https://github.com/kubernetes-client/python/issues/693
- https://github.com/kubernetes-client/python/issues/700
- https://github.com/max-rocket-internet/newrelic-controller/issues/4
The list...() operation (actually, the Kubernetes API) returns the resources in random arbitrary order. It can happen so that the very old resources (e.g. custom resources 1-month old) go last in the list.
Do you have any link to confirm that this is really designed behavior?
@mitar I couldn't find anything on this topic (but I didn't do a deep research). But this is the observed behaviour on the real cluster.
The only official description is here: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
The phrasing is made so that it does not specify how the initial list must be fetched, and what is the sort order. It only says that resourceVersion will be returned with the list, and —separately— that the "last returned| resourceVersion can be used to restore the watch. But the latter sentence means that the "last returned" is from another interrupted watch, not from the original list.
What can be useful is this: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
Here, it explains the nature of the resourceVersion and potential future changes. The last paragraph there hints that in the initial GET operation, the resourceVersion of a list must be used, not of the objects in the list:
This may be used to ensure that no mutations are missed between a GET of a resource (or list of resources) and a subsequent Watch, even if the current version of the resource is more recent. This is currently the main reason that list operations (GET on a collection) return resourceVersion.
Perhaps, this is the resource version to be used for watch-stream's initial seeding:
In [2]: kubernetes.config.load_kube_config() # developer's config files
In [3]: api = kubernetes.client.CustomObjectsApi()
In [4]: api_fn = api.list_cluster_custom_object
In [5]: rsp = api_fn(group='example.com', version='v1', plural='mycrds')
In [6]: rv = rsp['metadata']['resourceVersion'] # <<< of the list
In [7]: rv
In [10]: w = kubernetes.watch.Watch()
In [11]: stream = w.stream(api_fn, 'example.com', 'v1', 'mycrds', resource_version=rv)
In [12]: for ev in stream: print((ev['type'], ev['object'].get('metadata', {}).get('name'), ev['object'].get('metadata', {}).get('resourceVersion'), ev['object'] if ev['type'] == 'ERROR' else None))
I can implement this logic — if somebody more proficient with Kubernetes (than me) can confirm I understood it right.
I suggest you look up the go
implementation and how they do it. I think there is a list+watch operation there available in one utility function. Python does not have such utility function so I think that current watch implementation in Python is then correct to only operate on values from watch itself.
I think this is actually related to #700 and #693 and is expected. Just ran into this myself... the correct solution is outlined in that issues history.
This document describes how one should use watch. https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
One line in that doc is etcd by default is set to remember 5 mins of resourceVersions... I bet this is may not the case anymore on GKE (at least in my case).
@nolar Thanks for your investigation. I did some tests and I can confirm that it works as you write. 'Watch' should get a list of object to start watching with the proper version. In the same way kubectl's watch works. Let me know if you need help with implementing it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Just to be clear, it's not just List action response that's ordered arbitrarily,
but also the initial burst of fake ADDED
events when doing Watch without specifying resourceVersion?
Indeed, this scenario violates the documented guidance in api-concepts that "If the client watch is disconnected they can restart a new watch from the last returned resourceVersion, or ..."
Things I hope are true but can't find clear promise in k8s docs :pray:
- [ ] Does starting from List + Watch from returned collection's resourceVersion always avoid the inital burst of past
ADDED
events? - [ ] Are the events then always ordered?
- [ ] Even if you start with an old version?
- [ ] The docs explain List + Watch pattern avoids "missing any updates" / "ensure that no mutations are missed". But does List + Watch + Watch... sequence also guarantee not processing same event multiple times? (whether yes or no this would be good to have explained)
API reference only says this about specifying resourceVersion
:
When specified with a watch call, shows changes that occur after that particular version of a resource. Defaults to changes from the beginning of history.
When specified for list:
- if unset, then the result is returned from remote storage based on quorum-read flag;
- if it's 0, then we simply return what we currently have in cache, no guarantee;
- if set to non zero, then the result is at least as fresh as given rv.
I recommend opening an issue in kubernetes/community with these questions. This is good find of subtle points the k8s docs should explain better :clap:
The issue is that k8s gives no meaning to the resource version but the resource version is just the exported modification index of its underlying datastore, etcd.
etcd explicitly does not guarantee linearity of events https://github.com/kubernetes-client/python/issues/609#issuecomment-446302635
etcd does not ensure linearizability for watch operations. Users are expected to verify the revision of watch responses to ensure correct ordering.
but it provides the modification index to allow clients to reorder events again.
Imo the only fix is to interpret the resource version in client libraries as what it really is and not what kubernetes currently wants it to be, an opaque value probably in order to allow changing underlying datastore in future.
I have prepared a branch for that a while ago but have not posted it as I had no reproducing example. This branch reorders the events in the client via a priority queue: https://github.com/juliantaylor/python-base/commits/unordered-events
I assume the client-go does something very similar as its watch functions are based on a local cache anyway and then can return values in resource index order.
Hm, but if etcd returns versions per resource, that is per object, no? So what is then version returned by Kubernetes when you list multiple resources? How can you watch then a list of multiple resources and how can you resume that? Is the version of a list of multiple resources simple max(version of every resource in the list)? Is this how Kubernetes does it? etcd versions are monotonically increasing, but I am not sure if this is per resource or per all resources:
Revisions are monotonically increasing over the lifetime of a cluster.
Would not then simply be enough to just always take max of currently known latest version and new version information? So instead of just setting currently known latest version to whatever Kubernetes returned, you compare and set it to the max?
A list operation returns a single object and I assume kubernetes sorts the entries in that list itself so there should be no issue.
The problem is only with watch operations if they are unordered simply storing the maximum for a reset would not work.
But I now found this in the documentation of etcd: https://github.com/etcd-io/etcd/blob/master/Documentation/learning/api.md#watch-streams
Watches make three guarantees about events:
Ordered - events are ordered by revision; an event will never appear on a watch if it precedes an event in time that has already been posted.
This means that the results should not be unordered but then I do not know where the problem is. Maybe a buggy etcd version was used or kubernetes does not keep this guarantee in its api abstraction.
I was not the only one confused by the old documentation see https://github.com/etcd-io/etcd/issues/5592 Maybe it was only an issue in the past.
A little addition on this issue:
I have "solved" this by using the list-then-watch strategy. The resource version strings seems to be scoped to the custom resource type, or maybe to the whole cluster and all resources. You can treat it as a timestamp of the resulting dataset. For the list operation, it returns the resource version of the whole list — no necessary of the latest item in the list, but maybe of the current value of the version counter (as if it was "now"). The watch stream then continues from that point in versioning history.
However, when listing, the objects are modified in the list: they have no kind
& apiVersion
fields, unlike in the individual GETs or in the watch-streams.
We could in theory get those from the kind
& apiVersion
fields of the list, but it is also different: BlahBlahList
instead of BlahBlah
as in the individual resources.
The …List
suffix has to be removed, and I just assumed that the list's kind is generated and not configurable, which can be wrong: https://github.com/zalando-incubator/kopf/blob/0.25/kopf/clients/fetching.py#L107-L110
If the "watching-with-no-starting-point" feature is going to be (re)implemented in this client library, these aspects should be taken into account.
the list issue is https://github.com/kubernetes-client/python/issues/745 This is something that comes from the code generator and I have not found an easy way to fix it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
kubernetes discussion: https://github.com/kubernetes/kubernetes/issues/74022. They decided they're not gonna fix Watch with RV "" or "0", people should move to List+Watch. Note interesting discussion starting https://github.com/kubernetes/kubernetes/issues/74022#issuecomment-465055778 about watch from RV "0" not being correctly restartable even if the initial list-pretending-to-be-ADDED were sorted.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten