kubeclient
kubeclient copied to clipboard
watching stops without any notification
Already saw multiple times that the watcher just stops ... without crashing / notifying ... Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling
atm using below and calling .restart every x minutes
class KubernetesWatcher
def initialize(kuber_client, namespace)
@kuber_client = kuber_client
@namespace = namespace
end
def watch(&block)
loop do
@watcher = @kuber_client.watch_endpoints(namespace: @namespace)
@watcher.each(&block)
end
end
def restart
@watcher.finish
end
end
idk how to fix/debug this further but wanted to raise awareness.
cc @cben @agrare @moolitayer
@grosser when you say "just stops" do you mean .each { block... }
returns at some point?
or keeps running but never calls the block?
Which kubernetes version are you connecting to? And which ruby you're running?
don't know if related but I run oc get pod --watch
yesterday and noticed it just exited after some time.
idk if the block stops or it just idles ... I'll try to find out ... if that's the problem then it's easy to fix :D
On Thu, Nov 2, 2017 at 3:35 PM, Beni Cherniavsky-Paskin < [email protected]> wrote:
@grosser https://github.com/grosser when you say "just stops" do you mean .each { block... } returns at some point? or keeps running but never calls the block?
Which kubernetes version are you connecting to? And which ruby you're running?
don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abonas/kubeclient/issues/273#issuecomment-341578312, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAsZ4AOcWzGSn82WkeHSt3-l4JrUOXVks5sykOngaJpZM4QQWhG .
confirmed that the block just stops ... so having a reconnect by default or optional would be nice ... atm I'm just doing loop do ... watch ... end
Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.
then we should call that out in the readme ... "watcher will stop when it is disconnected, run it in a loop to make sure it stays running" or so
On Fri, Nov 3, 2017 at 5:17 AM, Federico Simoncelli < [email protected]> wrote:
Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abonas/kubeclient/issues/273#issuecomment-341687956, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAsZwUVxUKbjxfeiSU_FDqn_PFVDLOyks5sywRSgaJpZM4QQWhG .
then we should call that out in the readme ... "watcher will stop when it is disconnected, run it in a loop to make sure it stays running" or so
Definitely :+1:
I wrote a service which watches namespaces, for the purpose of emitting a continuous stream of events so a person (the next day) can see when his pod restarted, and for what reason, my complaint here is that yes, my watch loop is in an infinite loop, but when the loop starts again, you end up getting the same data you previously emitted because they overlap, this stinks, to work around it, I'm generating a hash and drop the "already read" events. Anyone have a better idea?
- ignore first events after restart ... not perfect but maybe better if missing an event is not that critical
loop do
ignore_until = Time.now.to_f + 0.5 # re-processing happens in the first 0.4s
kuber_client.watch_endpoints.each do |e|
next if Time.now.to_f < ignore_until
puts e
end
end
- store the timestamp of the last received event and reject all before that on restart
I've seen this happening with curl after little over an hour in kubernetes v1.7.6+a08f5eeb62
$ time curl -k -s -H "Authorization: Bearer $OSH_TOKEN" https://$OSH_HOST:8443/api/v1/pods?watch=139024
...
...
...
real 71m49.189s
user 0m0.336s
sys 0m0.332s
This suggests this is caused by a k8s issue.
@agrare @ladas FYI the pod watch in ManageIQ will probably encounter this issue and restart the collection worker(which sounds asymptomatic)
Thanks @moolitayer, we're okay because if the thread exits it will be restarted and currently we don't process the initial list of pods.
This is a little concerning when we move to using watches as the primary mechanism for refresh because then we will be processing the entire initial set every hour. Basically the same scheduled full refresh we do today :worried:
Hey, So I just wanted to add my latest finding here, following a k8s upgrade.
The logs are below, but long story short, the latest resource is 44924022
, however if you use that as your starting point, k8s returns 410 GONE (because these particularly resources haven't been updated in quite some time).
The only way to get a watcher started then is to use 0, which returns you ALL the objects.
You'll then need to filter the returned objects to be >= 44924022.
It's quite shit really, as you're potentially returning a lot of objects from the k8s api, especially when the connection times out so frequently (seemingly every 90seconds or so for CRDs in particular)
I, [2018-10-16T22:01:26.119618 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:01:26.882520 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset. Some events may have been missed!
I, [2018-10-16T22:01:26.882659 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:01:27.160224 #98517] INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:01:27.763382 #98517] INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:01:27.763739 #98517] INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:01:28.021200 #98517] INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:03:00.473269 #98517] WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44891278
I, [2018-10-16T22:03:00.473339 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44891278
E, [2018-10-16T22:03:00.916934 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44891278, will restart from 44924022 which is the next known offset. Some events may have been missed!
I, [2018-10-16T22:03:00.917068 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:03:01.411459 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset. Some events may have been missed!
I, [2018-10-16T22:03:01.411611 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:03:01.641530 #98517] INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:03:01.978240 #98517] INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:03:01.978561 #98517] INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:03:02.520127 #98517] INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:04:30.996193 #98517] WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44924022
I, [2018-10-16T22:04:30.996243 #98517] INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:04:31.654225 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset. Some events may have been missed!
Does anyone know if resourceVersion
is unique to the kind
, eg a v1/secret
, or is it incrementing across all objects?
reading https://github.com/kubernetes/kubernetes/issues/55230 it looks like you can specify a timeoutSeconds
on the watch api call?
Does anyone know if the resourceVersion ever rolls back over. At the moment quite a bit of my de-dupe code is basically resourceVersion > lastStoredResourceVersion
?
If that value didn't just continually increment then eventually, my app will break
The answers in this discussion: https://github.com/kubernetes/website/pull/6540#discussion_r160742783 explain some things well.
Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?
By default shared for many collections, but might be separate, for example Events have independent version by default. Do not assume any relation between versions of different collections.
resourceVersion > lastStoredResourceVersion
If that value didn't just continually increment then eventually, my app will break
So this presently works AFAIK but k8s devs are very insistent that versions are opaque strings, which might not even be integers in future, and clients shouldn't compare them for order, only equality :-(
The only way to get a watcher started then is to use 0, which returns you ALL the objects.
~~I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?~~
You can also do a list request (get_foos
) and take that version. That also loads all existing objects, but at least you get them separately from future versions.
- If you don't care about the list of existing objects, there are a few api options to reduce the traffic but nothing yet in kubeclient I think: chunking #283 #356, PartialObjectMetadata https://github.com/kubernetes/kubernetes/issues/53224
I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?
Nah, my bad, that would be too convenient to be true :wink: API docs say: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#watch-list-pod-v1-core
resourceVersion
— When specified with a watch call, shows changes that occur after that particular version of a resource. Defaults to changes from the beginning of history. When specified for list: - if unset, then the result is returned from remote storage based on quorum-read flag; - if it's 0, then we simply return what we currently have in cache, no guarantee; - if set to non zero, then the result is at least as fresh as given rv.
So there is slight difference between 0 and omitted but both give you old data...
FYI code I'm currently using to have cheap restarts and no re-plays on restart:
# frozen_string_literal: true
module Watching
def initialize(*)
super
@started = latest_resource_version
end
private
def watch_resource(type)
loop do
get_json(type, watch: true, resource_version: @started) do |notice|
if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
@started = latest_resource_version
else
@started = notice.dig(:object, :metadata, :resourceVersion)
yield notice
end
end
end
end
# get something cheap to get latest resourceVersion from the List object that is returned
def latest_resource_version
get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
end
end
Already saw multiple times that the watcher just stops ... without crashing / notifying ... Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling
atm using below and calling .restart every x minutes
class KubernetesWatcher def initialize(kuber_client, namespace) @kuber_client = kuber_client @namespace = namespace end def watch(&block) loop do @watcher = @kuber_client.watch_endpoints(namespace: @namespace) @watcher.each(&block) end end def restart @watcher.finish end end
idk how to fix/debug this further but wanted to raise awareness.
Is there no way to accomplish this "persistent" connection without using a loop? 😞
FYI code I'm currently using to have cheap restarts and no re-plays on restart:
# frozen_string_literal: true module Watching def initialize(*) super @started = latest_resource_version end private def watch_resource(type) loop do get_json(type, watch: true, resource_version: @started) do |notice| if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop @started = latest_resource_version else @started = notice.dig(:object, :metadata, :resourceVersion) yield notice end end end end # get something cheap to get latest resourceVersion from the List object that is returned def latest_resource_version get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion) end end
@grosser isn't there a race condition here when the latest resourceVersion changes in between and therefore you miss updates?
yes, but that would happen very rarely (when things broken down)
could improve by comparing new versions to previous started