keel with many deployments keel can reach the event buffer size

level=info msg="event channel is full, len: 128, cap: 128" context=buffer

We might want to remove the buffer altogether or do some batching in it as well to absorb some of the changes that are coming through.

Sep 13 '19 09:09 rusenask

I see that keel stopped updating some of my deployments, why others are updated successfully. I see exact that message in the logs, could it be reason?

Oct 15 '19 12:10 pavel-khritonenko

@rusenask I see the same message in the setup that I have :

Is there an interim solution?
Does it help if we run multiple keel containers to mitigate this? I am currently running a single keel container.

Nov 11 '19 22:11 kg-ops

@rusenask same issue on my side, running multiple keel pods should not solve since I'm using the polling approach. I'm looking into opportunity to make queue size configurable, do you have some other options?

Dec 11 '19 16:12 mavimo

Hi, I think we could just remove the buffer altogether, it should be fine without it. Feel free to create a PR :) I am really swamped for the last few months, therefore can't do much support on keel.

Dec 11 '19 16:12 rusenask

Edit: Keel seems to be working fine for us again, so I'm going to put this down to something else in our environment. We'll keep investigating if it crops up again.

May 10 '20 15:05 Novex

After a few months of everything working fine we just saw this issue again.

Interestingly, Keel eventually picked up on the image digest change ~37 hours later and successfully updated the deployments then. Although events are re-submitted to the channel, is it possible that some events take priority over others? Or something else keeping the event channel from being processed sequentially?

We've bumped up to Kubernetes 1.17.5 since the first report and have 301 pods total in the cluster, but other than that not much has changed.

Sep 29 '20 03:09 Novex

I am having the same issue. I have deployed keel locally (dev environment) and it works fine. I added to to one of our qa clusters (where there are many more pods running ) but it never seems to get the webhook I am sending (native). I am running the 'latest' tag.

How can I help debug this? Is there no logging for the web service? I can't even tell if it's hitting the web hook though the admin page at the same url works so I assume it's not that.

Dec 18 '20 13:12 vhodges-ttc

When you send the webhook directly through curl or any rest client, does it return an error or just 200 status?

Dec 18 '20 16:12 rusenask

@rusenask 200

using curl:

...
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
* We are completely uploaded and fine
< HTTP/2 200
< date: Fri, 18 Dec 2020 18:25:24 GMT
< content-length: 0
< access-control-allow-headers: Accept, Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization
< access-control-allow-methods: POST, GET, OPTIONS, PUT, DELETE
< access-control-allow-origin: *
< access-control-expose-headers: Authorization
< access-control-request-headers: Authorization
<
...

Dec 18 '20 18:12 vhodges-ttc

After a good day with DEBUG=true set I think I finally found the reason our deployment isn't updating. It turns out the event buffer full message for us was a red herring.

We have two deployments that use the same image, one of them has a custom pollSchedule so it doesn't update while it's taking backups, the other can update whenever.

The annotations on deployment 1:

  annotations:
    keel.sh/trigger: poll
    keel.sh/policy: force
    keel.sh/match-tag: "true"
    keel.sh/pollSchedule: "*/1 19-23,0-14 * *"

vs. the annotations on deployment 2:

  annotations:
    keel.sh/trigger: poll
    keel.sh/policy: force
    keel.sh/match-tag: 'true'

For some reason, this situation leads to the Keel watcher just checking the image once on boot and never again, hence why the image never updated. I also have a feeling it might also have to do with the fact that our pollSchedule is standard cron syntax, not the kind that keel uses which reads the first value as seconds (instead of minutes). Perhaps a combination of both?

Anyway, we've fixed the pollSchedule string (by adding a 0 to the front) and applied it to deployment 2 so they're both consistent and updates seem to be working again. I realise that this situation is pretty unique but hopefully if anyone else has a similar environment and this problem it gives you something to try.

Mar 08 '21 08:03 Novex

keel keel copied to clipboard

with many deployments keel can reach the event buffer size

keel
keel copied to clipboard