nebraska Flatcar machines stuck in `Waiting...` instead of pulling new release(s)

Description

After flipping pin to latest stable, a number of machines pulled down the latest download. We had paused downloads reboots over the weekend and came back to begin again a few days later and now we have a number of machines that are stuck in Current Status of Waiting.... When looking at update_engine logs the only mention I see is omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.. When lining this up for a similar error in the Nebraska logs matching that machineID, the only log I see is update complete.error. Is there any additional logging I can turn up to determine the actual root of this problem. I've tried things like restarting update_engine, Nebraska, etc. to see if I can get things unstuck without any luck.

Impact

Further downloads are not occurring and current_status is not accurately reflecting the status of this rollout.

Environment and steps to reproduce

Set-up: Nebraska 2.9 attempting to roll out 3975.2.1
Task: Flipped channel pin to 3975.2.1 to begin rollout
Action(s): Update pin to begin rollout, paused update of machines over a span of 2+ days (resulting in machines staying in the Downloaded state for a period of time prior to attempting to continue rollout a. [ requested the start of a new pod or container ] b. [ container image downloaded ]
Error: [describe the error that was triggered]

omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.
update complete.error

Expected behavior

Nebraska accurately reflecting current status and additional nodes continuing to download new release

Additional information

N/A

Oct 02 '24 15:10 tylerauerbeck

If it helps at all, the nodes in Waiting... seem to bunch up under On Hold in the dashboard for the channel.

Oct 02 '24 19:10 tylerauerbeck

Thank you for taking the time to report the issue, @tylerauerbeck!

I noticed it was reported a couple of months ago, and I wanted to check in to see if it's still relevant.

I just started to learn how the update server works a few weeks ago, but once had a similar experience when I was testing the update policy settings and reboot strategies - the status of nodes got stuck in the same Waiting ... status until I realized that I turned off the automatic reboot strategy.

Following questions may help us to investigate what happened:

Was there any nodes that have successfully got updated before the "pause"?
What do you mean by pause? Turning off reboots or disabling updates from Nebraska UI?
The reboot strategy can be checked in the following files:

/usr/share/flatcar/update.conf
/etc/flatcar/update.conf <--- overwrite the previous file's settings

update_engine_client -status is helpful the check the status of the update process on an individual node
What was the update policy set in Nebraska for the group?

Jan 16 '25 14:01 ervcz

Going to dig the details out of the thread we had in Slack on this so we can revisit this if necessary.

Jul 03 '25 03:07 tylerauerbeck

We can close this thread if your problem got solved. However I'm still curious how did you overcome the problem :D

Jul 03 '25 08:07 ervcz

Well I wouldn't say we overcame it as much as we understand what causes it and I think the place we had landed in the thread is that we could provide some better logging.

Thread for reference: https://kubernetes.slack.com/archives/C03GQ8B5XNJ/p1727813084833289

I'll try to find some time to dig that particulars out of that thread so we can follow up on it.

Jul 04 '25 21:07 tylerauerbeck