document that watcher stops see #273 and add each_with_retry
nicer alternative to the current loop { watch } we are using, could also make retry: true an option for each, but wanted to keep it clean
@simon3z
/cc @jonmoter
As mentioned on issue #273 I don't think that a logic as each_with_retry belongs to kubeclient.
:+1: for the documentation.
@grosser in the specifics of the code: what's the new done flag is used for?
cc @moolitayer @cben
it will be false if the user break or return inside the yield
to know the difference between finished and aborted code has to be inside
the watcher
the simple loop { } I did previously will not allow it to ever finish,
that's why I added each_with_retry :)
On Thu, Nov 9, 2017 at 2:18 AM, Beni Cherniavsky-Paskin < [email protected]> wrote:
@cben commented on this pull request.
In lib/kubeclient/watch_stream.rb https://github.com/abonas/kubeclient/pull/275#discussion_r149917529:
end
donethe only way I see this can be false is if connection was interrupted before at least 1 complete line was received?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abonas/kubeclient/pull/275#pullrequestreview-75386056, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAsZ_NXlBklUdm9HQTU94oyU48bSJD1ks5s0tF8gaJpZM4QXcOg .
ok to merge this ?
I'd think having a stable/reliable watcher is a very common usecase, there can be many reasons a connection is interrupted and blindly retrying breaks when also using .finish
@simon3z I can haz merge ?
@simon3z I can haz merge ?
cc @moolitayer @cben
Hi, I'm very much aware I owe a review here. Meanwhile I've been trying to figure out how to restart precisely. I recently asked on https://github.com/kubernetes/website/pull/6540#discussion_r160745657 and IIUC the answers, it's officially possible to take last seen resourceVersion of individual object in collection during watch, and start new collection watch from that version. (should deal with possibility of 410 error meaning version is too old.)
Also @pkliczewski now told me apparently official Go client's "informer framework" solves it, hadn't checked it yet.
So, about this PR:
- WatchStream here remains version-agnostic. If user created it via
watch_*functions withresource_version:arg, it will have version baked into url, if not it won't. each_with_retrywith specific starting version is not good. It means you'll get same data again. Generally, restart with overlap is problematic to consume: assuming you want to ignore out-of-date updates, those are hard to recognize, because comparing 2 versions is not allowed.
IMHO restart logic does belong in kubeclient, but we'd need something more version-aware. @grosser what do you think, how would you like to continue this?
so "remove resource_version from subsequent calls"
or "keep track of resource_version and pass it when retrying" ?
... first seems reasonable/easy, second would be nice, but seems a bit scary
We are having similar issues when connection is closed and the watcher crashes instead of retrying.
+1 for having some logic in the kubeclient to:
- Retry when connection is closed.
- Keep track of
resource_versionand pass it when retrying.
I quite forgot this PR. Putting resumption aside, there was a good point here that .each should distinguish deliberate exit — by break / return / .finish — from forced connection closure.
I'm by now aware of 5 ways watching can exit:
.finish=> exits cleanly :heavy_check_mark:break/return=> exits cleanly :heavy_check_mark:- error such as 410 Gone => does not raise any exception, instead passes a "notice" with
ERRORtype into the block, then exits cleanly :man_facepalming: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/213 - connection closed by server => may exit cleanly with no exception :confused:
- connection closed by server/middleboxes(?)/network problems => may raise
HTTP::ConnectionError.
https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/194
The technical reason is we have handle_exception helper for RestClient code paths but forgot to do any error handling for HTTP code paths :rofl:
(4) vs (5) is clearly a mess. I'm tempted to say user doesn't care how exactly a watch got disconnected — the bottom line is the "infinite" loop got broken, and user needs to resume/retry in same way.
If backward compatibility is not a question, I'd argue (3) (4) (5) should all throw an exception.
(and not leak HTTP exception — users shouldn't be aware about the precise http libraries kubeclient uses — better wrap with our Kubeclient::HttpError).
It was an unplanned accident that (4) may exit cleanly; from POV of caller the call promises an "infinite" loop, and IMO any disconnection should be an in-your-face error so people understand it's something they must handle.
The solution here of returning a boolean has benefit of backward compatibility. There is existing code out there calling watch and handling resumption, and it's written assuming clean exit (4)... Most of it will fail in case of error (5) though?
But a boolean doesn't cover (3) well. I mean current behavior of passing error data into the block is ... workable, but not elegant. And handling 410 is essential to correct usage if you pass resourceVersion!
A "meta" consideration: as we've seen here, there is value in accumulating expereince of how things actually break, and if we just swallow errors into a boolean, there will be no details in logs, so we (kubeclient) will not learn, and users will not learn.
I propose to break compatibility (and bump major version) and turn (3) (4) (5) into Kubeclient errors.
@grosser @qingling128 @jcantrill @mszabo @Stono @fradee what do you think? Would that be a step forward? Or too disruptive?
[And yes, I still hope in some future version we'll get Kubeclient itself resuming after disconnection. IFF initial resourceVersion was set. And it will still leave caller to handle 410, so I think we need to fix the interface first anyway.]
[I'm also working on adding some docs on watching, because README is really lacking...]
@cben Thanks for working on this. I think it would be a step forward.
Be aware if you're watching streams that don't change frequently, you'll see 410 GONE quite a bit when you reconnect.
Examples are say, CRD's that are rarely updated. When you reconnect you'll likely get a 410 GONE if you're trying to resume from the last seen CRD, and you'll need to resume from 0 instead - dropping any ADDED events with a timestamp <= the start of your stream.
raise on 3-4-5 sounds like an improvement over silently stopping the loop, it should be easy to see/catch for client code.
I'd still like a resume: true option (or by default, but risks loops) or so that takes the resourceVersion of the incoming objects and then reconnects with the latest version when disconnected.
Re https://github.com/abonas/kubeclient/pull/275#issuecomment-591928097: That does sound like a step forward to me. We did run into an issue fluent-plugin-kubernetes_metadata_filter/pull/214 when an exception was not thrown as expected.
Confirmed (3) {"type": "ERROR", ...} notices are returned with a 200 status :neutral_face:
Turns out it's deliberate in kubernetes per https://github.com/kubernetes/kubernetes/issues/25151:
Either way, for web sockets if we don't fix this clients can't get this error (the browser eats the 410 during the websocket negotiation and client code can't get access). So it's at least broken for web sockets.
This is unfortunate for :raw mode — if we don't parse the input, we can't raise an exception.
- Could document this exception?
- Could or add a heuristic parsing if raw text contains
ERROR, just to check whether to raise an exception, otherwise discarding the parse results.
Thank you for you work and great discussions! any plan for this to be merged?