kubeclient icon indicating copy to clipboard operation
kubeclient copied to clipboard

watching stops without any notification

Open grosser opened this issue 7 years ago • 22 comments

Already saw multiple times that the watcher just stops ... without crashing / notifying ... Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes

class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end

idk how to fix/debug this further but wanted to raise awareness.

grosser avatar Nov 02 '17 21:11 grosser

cc @cben @agrare @moolitayer

simon3z avatar Nov 02 '17 21:11 simon3z

@grosser when you say "just stops" do you mean .each { block... } returns at some point? or keeps running but never calls the block?

Which kubernetes version are you connecting to? And which ruby you're running?

don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time.

cben avatar Nov 02 '17 22:11 cben

idk if the block stops or it just idles ... I'll try to find out ... if that's the problem then it's easy to fix :D

On Thu, Nov 2, 2017 at 3:35 PM, Beni Cherniavsky-Paskin < [email protected]> wrote:

@grosser https://github.com/grosser when you say "just stops" do you mean .each { block... } returns at some point? or keeps running but never calls the block?

Which kubernetes version are you connecting to? And which ruby you're running?

don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abonas/kubeclient/issues/273#issuecomment-341578312, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAsZ4AOcWzGSn82WkeHSt3-l4JrUOXVks5sykOngaJpZM4QQWhG .

grosser avatar Nov 02 '17 22:11 grosser

confirmed that the block just stops ... so having a reconnect by default or optional would be nice ... atm I'm just doing loop do ... watch ... end

grosser avatar Nov 03 '17 00:11 grosser

Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.

simon3z avatar Nov 03 '17 12:11 simon3z

then we should call that out in the readme ... "watcher will stop when it is disconnected, run it in a loop to make sure it stays running" or so

On Fri, Nov 3, 2017 at 5:17 AM, Federico Simoncelli < [email protected]> wrote:

Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abonas/kubeclient/issues/273#issuecomment-341687956, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAsZwUVxUKbjxfeiSU_FDqn_PFVDLOyks5sywRSgaJpZM4QQWhG .

grosser avatar Nov 03 '17 14:11 grosser

then we should call that out in the readme ... "watcher will stop when it is disconnected, run it in a loop to make sure it stays running" or so

Definitely :+1:

simon3z avatar Nov 06 '17 10:11 simon3z

I wrote a service which watches namespaces, for the purpose of emitting a continuous stream of events so a person (the next day) can see when his pod restarted, and for what reason, my complaint here is that yes, my watch loop is in an infinite loop, but when the loop starts again, you end up getting the same data you previously emitted because they overlap, this stinks, to work around it, I'm generating a hash and drop the "already read" events. Anyone have a better idea?

Eric-Fontana-Bose avatar Nov 15 '17 18:11 Eric-Fontana-Bose

  • ignore first events after restart ... not perfect but maybe better if missing an event is not that critical
  loop do
      ignore_until = Time.now.to_f + 0.5 # re-processing happens in the first 0.4s
      kuber_client.watch_endpoints.each do |e|
        next if Time.now.to_f < ignore_until
        puts e
      end
  end
  • store the timestamp of the last received event and reject all before that on restart

grosser avatar Nov 15 '17 19:11 grosser

I've seen this happening with curl after little over an hour in kubernetes v1.7.6+a08f5eeb62

$ time curl -k -s -H "Authorization: Bearer $OSH_TOKEN"     https://$OSH_HOST:8443/api/v1/pods?watch=139024
...
...
...
real	71m49.189s
user	0m0.336s
sys	0m0.332s

This suggests this is caused by a k8s issue.

moolitayer avatar Nov 16 '17 14:11 moolitayer

@agrare @ladas FYI the pod watch in ManageIQ will probably encounter this issue and restart the collection worker(which sounds asymptomatic)

moolitayer avatar Nov 16 '17 14:11 moolitayer

Thanks @moolitayer, we're okay because if the thread exits it will be restarted and currently we don't process the initial list of pods.

This is a little concerning when we move to using watches as the primary mechanism for refresh because then we will be processing the entire initial set every hour. Basically the same scheduled full refresh we do today :worried:

agrare avatar Nov 16 '17 14:11 agrare

Hey, So I just wanted to add my latest finding here, following a k8s upgrade.

The logs are below, but long story short, the latest resource is 44924022, however if you use that as your starting point, k8s returns 410 GONE (because these particularly resources haven't been updated in quite some time).

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

You'll then need to filter the returned objects to be >= 44924022.

It's quite shit really, as you're potentially returning a lot of objects from the k8s api, especially when the connection times out so frequently (seemingly every 90seconds or so for CRDs in particular)

I, [2018-10-16T22:01:26.119618 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:01:26.882520 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:01:26.882659 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:01:27.160224 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:01:27.763382 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:01:27.763739 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:01:28.021200 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:03:00.473269 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44891278
I, [2018-10-16T22:03:00.473339 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44891278
E, [2018-10-16T22:03:00.916934 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44891278, will restart from 44924022 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:00.917068 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:03:01.411459 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:01.411611 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:03:01.641530 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:03:01.978240 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:03:01.978561 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:03:02.520127 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:04:30.996193 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44924022
I, [2018-10-16T22:04:30.996243 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:04:31.654225 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!

Stono avatar Oct 16 '18 21:10 Stono

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

Stono avatar Oct 16 '18 21:10 Stono

reading https://github.com/kubernetes/kubernetes/issues/55230 it looks like you can specify a timeoutSeconds on the watch api call?

Stono avatar Oct 16 '18 22:10 Stono

Does anyone know if the resourceVersion ever rolls back over. At the moment quite a bit of my de-dupe code is basically resourceVersion > lastStoredResourceVersion?

If that value didn't just continually increment then eventually, my app will break

Stono avatar Oct 17 '18 07:10 Stono

The answers in this discussion: https://github.com/kubernetes/website/pull/6540#discussion_r160742783 explain some things well.

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

By default shared for many collections, but might be separate, for example Events have independent version by default. Do not assume any relation between versions of different collections.

resourceVersion > lastStoredResourceVersion

If that value didn't just continually increment then eventually, my app will break

So this presently works AFAIK but k8s devs are very insistent that versions are opaque strings, which might not even be integers in future, and clients shouldn't compare them for order, only equality :-(

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

~~I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?~~

You can also do a list request (get_foos) and take that version. That also loads all existing objects, but at least you get them separately from future versions.

  • If you don't care about the list of existing objects, there are a few api options to reduce the traffic but nothing yet in kubeclient I think: chunking #283 #356, PartialObjectMetadata https://github.com/kubernetes/kubernetes/issues/53224

cben avatar Oct 18 '18 09:10 cben

I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?

Nah, my bad, that would be too convenient to be true :wink: API docs say: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#watch-list-pod-v1-core

resourceVersion — When specified with a watch call, shows changes that occur after that particular version of a resource. Defaults to changes from the beginning of history. When specified for list: - if unset, then the result is returned from remote storage based on quorum-read flag; - if it's 0, then we simply return what we currently have in cache, no guarantee; - if set to non zero, then the result is at least as fresh as given rv.

So there is slight difference between 0 and omitted but both give you old data...

cben avatar Oct 18 '18 09:10 cben

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

grosser avatar Dec 20 '19 03:12 grosser

Already saw multiple times that the watcher just stops ... without crashing / notifying ... Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes

class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end

idk how to fix/debug this further but wanted to raise awareness.

Is there no way to accomplish this "persistent" connection without using a loop? 😞

mecampbellsoup avatar May 07 '20 03:05 mecampbellsoup

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

@grosser isn't there a race condition here when the latest resourceVersion changes in between and therefore you miss updates?

aantn avatar Mar 19 '21 15:03 aantn

yes, but that would happen very rarely (when things broken down) could improve by comparing new versions to previous started

grosser avatar Mar 28 '21 21:03 grosser