fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Failed orbit config fetcher middleware shouldn't stop all other middlewares from running

Open roperzh opened this issue 2 years ago • 12 comments

Fleet version: 4.32

🧑‍💻  Expected behavior

If one of the middlewares fail, all other middlewares still run

💥  Actual behavior

As soon as one middleware fails, all upstream middlewares fail too

More info

Every 30 seconds, Orbit does a POST /api/fleet/orbit/config to the Fleet server and gets a configuration, this configuration includes values like:

  • New agent flags
  • Nudge configuration
  • Notifications to run commands on the host
  • Etc.

Currently, after we get the response, many different operations are "chained" together in a middleware fashion:

flowchart
   A[POST  /api/fleet/orbit/config] --> B[Write agent flags] --> C[Launch Nudge] --> D[Run profiles renew command] --> E[Run disk encryption commands]

More information was described by Lucas here: https://github.com/fleetdm/fleet/pull/12654#pullrequestreview-1516289261

Idea: There should be one config getter routine (that calls /api/fleet/orbit/config) and then you can have many types like "Nudge", "SwiftDialog", "FlagUpdater" etc. registering to process the response (independently, not in a middleware fashion).

roperzh avatar Jul 06 '23 12:07 roperzh

@roperzh is this ticket closed with #12654? If so, I'm going to move it to the release column on the bug board.

RachelElysia avatar Jul 10 '23 17:07 RachelElysia

@RachelElysia no, this ticket is a follow-up to implement an idea from the other ticket.

roperzh avatar Jul 10 '23 17:07 roperzh

@lucasmrod @juan-fdz-hawa Putting this on your radar that we want to estimate moving the config getter routine from a middleware fashion to independent response processing on Wednesday.

RachelElysia avatar Jul 10 '23 17:07 RachelElysia

@lucasmrod "this is not a released bug as @roperzh patched the issue, push back for estimation for when we want to prioritize this"

@xpkoala " #11980 was fixed and should be tagged as a released-bug, so this might be more like a feature-request"

@lucasmrod "Small refactor on orbit but testing orbit is more tricky"

Estimate: 3 pts including testing

Potential engineers to assign: @lucasmrod @juan-fdz-hawa

RachelElysia avatar Jul 12 '23 18:07 RachelElysia

Let's treat this as a bug since Fleet is not behaving the way we want it to.

lukeheath avatar Jul 12 '23 21:07 lukeheath

@roperzh Could you please describe the impact on customers? How frequent is this (guesstimate), if happens what's the outcome?

sharon-fdm avatar Aug 02 '23 16:08 sharon-fdm

@sharon-fdm @lukeheath sorry it took me so long to answer, I was OOO and I'm catching up.

Let me prefix this by saying that I don't think this should be a bug, it should be an ~engineering-initiated issue. Nothing is broken and needs to be fixed, but it would be really good to get this in place to prevent future headaches. Apologies for mislabeling this in the first place (I created the issue in a hurry)

I will update the issue description with what I'm writing below, but to answer @sharon-fdm's questions:

Every 30 seconds, Orbit does a POST /api/fleet/orbit/config to the Fleet server and gets a configuration, this configuration includes values like:

  • New agent flags
  • Nudge configuration
  • Notifications to run commands on the host
  • Etc.

Currently, after we get the response, many different operations that make use of the configuration are "chained" together in a middleware fashion:

flowchart
   A[POST  /api/fleet/orbit/config] --> B[Write agent flags] --> C[Launch Nudge] --> D[Run profiles renew command] --> E[Run disk encryption commands]

The issue is: If one of the operations in the chain fails, all the following operations are not executed. The proposed idea is: refactor the code so if any of the items fails, all other items are still executed. Impact on customers: will depend on how early in the chain is the failure, if it's early, many of the operations that orbit performs will not be executed. How frequent is this (guesstimate): let's say 1 every 3 releases? really hard to tell honestly. I think it's the wrong way to think about this issue in particular.

roperzh avatar Aug 09 '23 02:08 roperzh

@roperzh Thanks for the clarification. Because our goal is to provide resilient services and this issue blocks our goal, it's fair to call it a bug in how it was implemented.

lukeheath avatar Aug 11 '23 00:08 lukeheath

Bug has aged out. Moved back to drafting

ireedy avatar Sep 05 '23 14:09 ireedy

@roperzh This ticket is going to go to feature fest and we'll treat it as a story. If there's any impact of not fixing this, would you please update the description to describe that? Trying to understand business impact.

lukeheath avatar Sep 11 '23 19:09 lukeheath

@roperzh please bring to FF when it's ready.

zhumo avatar Sep 19 '23 16:09 zhumo

@roperzh Another impact is that some of those wrappers affect the latency of the GetConfig original call. We haven't used the same approach for all wrappers, originally it was not intended to affect the returned error (and it was affecting only minimally the latency), e.g. with the run scripts wrapper: https://github.com/fleetdm/fleet/blob/557c202b03522b470661540e253ee48e930bad3c/orbit/pkg/update/notifications.go#L369-L399

What it does is if the wrapped GetConfig succeeds (which should be basically if config got refreshed successfully), it looks for scripts to run and if it has one, launches a goroutine to do, not affecting the call. Not sure if such an approach is possible for all wrappers (especially the goroutine), but not affecting the returned error should be possible.

But we're starting to have a good number of those middleware stuff, maybe it's time we change the approach altogether (e.g. only do the real GetConfig and then dispatch based on the notifications).

mna avatar Apr 23 '24 12:04 mna

Orbit's chain breaks, halts, Fleet's process finds new paths, Efficiency exalts.

fleet-release avatar May 17 '24 14:05 fleet-release