keel
keel copied to clipboard
with many deployments keel can reach the event buffer size
level=info msg="event channel is full, len: 128, cap: 128" context=buffer
We might want to remove the buffer altogether or do some batching in it as well to absorb some of the changes that are coming through.
I see that keel stopped updating some of my deployments, why others are updated successfully. I see exact that message in the logs, could it be reason?
@rusenask I see the same message in the setup that I have :
- Is there an interim solution?
- Does it help if we run multiple keel containers to mitigate this? I am currently running a single keel container.
@rusenask same issue on my side, running multiple keel pods should not solve since I'm using the polling approach. I'm looking into opportunity to make queue size configurable, do you have some other options?
Hi, I think we could just remove the buffer altogether, it should be fine without it. Feel free to create a PR :) I am really swamped for the last few months, therefore can't do much support on keel.
Edit: Keel seems to be working fine for us again, so I'm going to put this down to something else in our environment. We'll keep investigating if it crops up again.
After a few months of everything working fine we just saw this issue again.
Interestingly, Keel eventually picked up on the image digest change ~37 hours later and successfully updated the deployments then. Although events are re-submitted to the channel, is it possible that some events take priority over others? Or something else keeping the event channel from being processed sequentially?
We've bumped up to Kubernetes 1.17.5 since the first report and have 301 pods total in the cluster, but other than that not much has changed.
I am having the same issue. I have deployed keel locally (dev environment) and it works fine. I added to to one of our qa clusters (where there are many more pods running ) but it never seems to get the webhook I am sending (native). I am running the 'latest' tag.
How can I help debug this? Is there no logging for the web service? I can't even tell if it's hitting the web hook though the admin page at the same url works so I assume it's not that.
When you send the webhook directly through curl or any rest client, does it return an error or just 200 status?
@rusenask 200
using curl:
...
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
* We are completely uploaded and fine
< HTTP/2 200
< date: Fri, 18 Dec 2020 18:25:24 GMT
< content-length: 0
< access-control-allow-headers: Accept, Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization
< access-control-allow-methods: POST, GET, OPTIONS, PUT, DELETE
< access-control-allow-origin: *
< access-control-expose-headers: Authorization
< access-control-request-headers: Authorization
<
...
After a good day with DEBUG=true
set I think I finally found the reason our deployment isn't updating. It turns out the event buffer full message for us was a red herring.
We have two deployments that use the same image, one of them has a custom pollSchedule
so it doesn't update while it's taking backups, the other can update whenever.
The annotations on deployment 1:
annotations:
keel.sh/trigger: poll
keel.sh/policy: force
keel.sh/match-tag: "true"
keel.sh/pollSchedule: "*/1 19-23,0-14 * *"
vs. the annotations on deployment 2:
annotations:
keel.sh/trigger: poll
keel.sh/policy: force
keel.sh/match-tag: 'true'
For some reason, this situation leads to the Keel watcher just checking the image once on boot and never again, hence why the image never updated. I also have a feeling it might also have to do with the fact that our pollSchedule
is standard cron syntax, not the kind that keel uses which reads the first value as seconds (instead of minutes). Perhaps a combination of both?
Anyway, we've fixed the pollSchedule
string (by adding a 0 to the front) and applied it to deployment 2 so they're both consistent and updates seem to be working again. I realise that this situation is pretty unique but hopefully if anyone else has a similar environment and this problem it gives you something to try.