nebraska icon indicating copy to clipboard operation
nebraska copied to clipboard

Flatcar machines stuck in `Waiting...` instead of pulling new release(s)

Open tylerauerbeck opened this issue 1 year ago • 5 comments

Description

After flipping pin to latest stable, a number of machines pulled down the latest download. We had paused downloads reboots over the weekend and came back to begin again a few days later and now we have a number of machines that are stuck in Current Status of Waiting.... When looking at update_engine logs the only mention I see is omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.. When lining this up for a similar error in the Nebraska logs matching that machineID, the only log I see is update complete.error. Is there any additional logging I can turn up to determine the actual root of this problem. I've tried things like restarting update_engine, Nebraska, etc. to see if I can get things unstuck without any luck.

Impact

Further downloads are not occurring and current_status is not accurately reflecting the status of this rollout.

Environment and steps to reproduce

  1. Set-up: Nebraska 2.9 attempting to roll out 3975.2.1
  2. Task: Flipped channel pin to 3975.2.1 to begin rollout
  3. Action(s): Update pin to begin rollout, paused update of machines over a span of 2+ days (resulting in machines staying in the Downloaded state for a period of time prior to attempting to continue rollout a. [ requested the start of a new pod or container ] b. [ container image downloaded ]
  4. Error: [describe the error that was triggered]
  • omaha_request_action.cc:629] HTTP reported success but Omaha reports an error.
  • update complete.error

Expected behavior

Nebraska accurately reflecting current status and additional nodes continuing to download new release

Additional information

N/A

tylerauerbeck avatar Oct 02 '24 15:10 tylerauerbeck

If it helps at all, the nodes in Waiting... seem to bunch up under On Hold in the dashboard for the channel.

tylerauerbeck avatar Oct 02 '24 19:10 tylerauerbeck

Thank you for taking the time to report the issue, @tylerauerbeck!

I noticed it was reported a couple of months ago, and I wanted to check in to see if it's still relevant.

I just started to learn how the update server works a few weeks ago, but once had a similar experience when I was testing the update policy settings and reboot strategies - the status of nodes got stuck in the same Waiting ... status until I realized that I turned off the automatic reboot strategy.

Following questions may help us to investigate what happened:

  • Was there any nodes that have successfully got updated before the "pause"?
  • What do you mean by pause? Turning off reboots or disabling updates from Nebraska UI?
  • The reboot strategy can be checked in the following files:
/usr/share/flatcar/update.conf
/etc/flatcar/update.conf <--- overwrite the previous file's settings
  • update_engine_client -status is helpful the check the status of the update process on an individual node
  • What was the update policy set in Nebraska for the group?

ervcz avatar Jan 16 '25 14:01 ervcz

Going to dig the details out of the thread we had in Slack on this so we can revisit this if necessary.

tylerauerbeck avatar Jul 03 '25 03:07 tylerauerbeck

We can close this thread if your problem got solved. However I'm still curious how did you overcome the problem :D

ervcz avatar Jul 03 '25 08:07 ervcz

Well I wouldn't say we overcame it as much as we understand what causes it and I think the place we had landed in the thread is that we could provide some better logging.

Thread for reference: https://kubernetes.slack.com/archives/C03GQ8B5XNJ/p1727813084833289

I'll try to find some time to dig that particulars out of that thread so we can follow up on it.

tylerauerbeck avatar Jul 04 '25 21:07 tylerauerbeck