Flatcar machines stuck in `Waiting...` instead of pulling new release(s)
Description
After flipping pin to latest stable, a number of machines pulled down the latest download. We had paused downloads reboots over the weekend and came back to begin again a few days later and now we have a number of machines that are stuck in Current Status of Waiting.... When looking at update_engine logs the only mention I see is omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.. When lining this up for a similar error in the Nebraska logs matching that machineID, the only log I see is update complete.error. Is there any additional logging I can turn up to determine the actual root of this problem. I've tried things like restarting update_engine, Nebraska, etc. to see if I can get things unstuck without any luck.
Impact
Further downloads are not occurring and current_status is not accurately reflecting the status of this rollout.
Environment and steps to reproduce
- Set-up: Nebraska
2.9attempting to roll out3975.2.1 - Task: Flipped channel pin to
3975.2.1to begin rollout - Action(s): Update pin to begin rollout, paused update of machines over a span of 2+ days (resulting in machines staying in the
Downloadedstate for a period of time prior to attempting to continue rollout a. [ requested the start of a new pod or container ] b. [ container image downloaded ] - Error: [describe the error that was triggered]
omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.update complete.error
Expected behavior
Nebraska accurately reflecting current status and additional nodes continuing to download new release
Additional information
N/A
If it helps at all, the nodes in Waiting... seem to bunch up under On Hold in the dashboard for the channel.
Thank you for taking the time to report the issue, @tylerauerbeck!
I noticed it was reported a couple of months ago, and I wanted to check in to see if it's still relevant.
I just started to learn how the update server works a few weeks ago, but once had a similar experience when I was testing the update policy settings and reboot strategies - the status of nodes got stuck in the same Waiting ... status until I realized that I turned off the automatic reboot strategy.
Following questions may help us to investigate what happened:
- Was there any nodes that have successfully got updated before the "pause"?
- What do you mean by pause? Turning off reboots or disabling updates from Nebraska UI?
- The reboot strategy can be checked in the following files:
/usr/share/flatcar/update.conf
/etc/flatcar/update.conf <--- overwrite the previous file's settings
update_engine_client -statusis helpful the check the status of the update process on an individual node- What was the update policy set in Nebraska for the group?
Going to dig the details out of the thread we had in Slack on this so we can revisit this if necessary.
We can close this thread if your problem got solved. However I'm still curious how did you overcome the problem :D
Well I wouldn't say we overcame it as much as we understand what causes it and I think the place we had landed in the thread is that we could provide some better logging.
Thread for reference: https://kubernetes.slack.com/archives/C03GQ8B5XNJ/p1727813084833289
I'll try to find some time to dig that particulars out of that thread so we can follow up on it.