fleet
fleet copied to clipboard
Failed orbit config fetcher middleware shouldn't stop all other middlewares from running
Fleet version: 4.32
🧑💻 Expected behavior
If one of the middlewares fail, all other middlewares still run
💥 Actual behavior
As soon as one middleware fails, all upstream middlewares fail too
More info
Every 30 seconds, Orbit does a POST /api/fleet/orbit/config to the Fleet server and gets a configuration, this configuration includes values like:
- New agent flags
- Nudge configuration
- Notifications to run commands on the host
- Etc.
Currently, after we get the response, many different operations are "chained" together in a middleware fashion:
flowchart
A[POST /api/fleet/orbit/config] --> B[Write agent flags] --> C[Launch Nudge] --> D[Run profiles renew command] --> E[Run disk encryption commands]
More information was described by Lucas here: https://github.com/fleetdm/fleet/pull/12654#pullrequestreview-1516289261
Idea: There should be one config getter routine (that calls /api/fleet/orbit/config) and then you can have many types like "Nudge", "SwiftDialog", "FlagUpdater" etc. registering to process the response (independently, not in a middleware fashion).
@roperzh is this ticket closed with #12654? If so, I'm going to move it to the release column on the bug board.
@RachelElysia no, this ticket is a follow-up to implement an idea from the other ticket.
@lucasmrod @juan-fdz-hawa Putting this on your radar that we want to estimate moving the config getter routine from a middleware fashion to independent response processing on Wednesday.
@lucasmrod "this is not a released bug as @roperzh patched the issue, push back for estimation for when we want to prioritize this"
@xpkoala " #11980 was fixed and should be tagged as a released-bug, so this might be more like a feature-request"
@lucasmrod "Small refactor on orbit but testing orbit is more tricky"
Estimate: 3 pts including testing
Potential engineers to assign: @lucasmrod @juan-fdz-hawa
Let's treat this as a bug since Fleet is not behaving the way we want it to.
@roperzh Could you please describe the impact on customers? How frequent is this (guesstimate), if happens what's the outcome?
@sharon-fdm @lukeheath sorry it took me so long to answer, I was OOO and I'm catching up.
Let me prefix this by saying that I don't think this should be a bug, it should be an ~engineering-initiated issue. Nothing is broken and needs to be fixed, but it would be really good to get this in place to prevent future headaches. Apologies for mislabeling this in the first place (I created the issue in a hurry)
I will update the issue description with what I'm writing below, but to answer @sharon-fdm's questions:
Every 30 seconds, Orbit does a POST /api/fleet/orbit/config to the Fleet server and gets a configuration, this configuration includes values like:
- New agent flags
- Nudge configuration
- Notifications to run commands on the host
- Etc.
Currently, after we get the response, many different operations that make use of the configuration are "chained" together in a middleware fashion:
flowchart
A[POST /api/fleet/orbit/config] --> B[Write agent flags] --> C[Launch Nudge] --> D[Run profiles renew command] --> E[Run disk encryption commands]
The issue is: If one of the operations in the chain fails, all the following operations are not executed. The proposed idea is: refactor the code so if any of the items fails, all other items are still executed. Impact on customers: will depend on how early in the chain is the failure, if it's early, many of the operations that orbit performs will not be executed. How frequent is this (guesstimate): let's say 1 every 3 releases? really hard to tell honestly. I think it's the wrong way to think about this issue in particular.
@roperzh Thanks for the clarification. Because our goal is to provide resilient services and this issue blocks our goal, it's fair to call it a bug in how it was implemented.
Bug has aged out. Moved back to drafting
@roperzh This ticket is going to go to feature fest and we'll treat it as a story. If there's any impact of not fixing this, would you please update the description to describe that? Trying to understand business impact.
@roperzh please bring to FF when it's ready.
@roperzh Another impact is that some of those wrappers affect the latency of the GetConfig original call. We haven't used the same approach for all wrappers, originally it was not intended to affect the returned error (and it was affecting only minimally the latency), e.g. with the run scripts wrapper: https://github.com/fleetdm/fleet/blob/557c202b03522b470661540e253ee48e930bad3c/orbit/pkg/update/notifications.go#L369-L399
What it does is if the wrapped GetConfig succeeds (which should be basically if config got refreshed successfully), it looks for scripts to run and if it has one, launches a goroutine to do, not affecting the call. Not sure if such an approach is possible for all wrappers (especially the goroutine), but not affecting the returned error should be possible.
But we're starting to have a good number of those middleware stuff, maybe it's time we change the approach altogether (e.g. only do the real GetConfig and then dispatch based on the notifications).
Orbit's chain breaks, halts, Fleet's process finds new paths, Efficiency exalts.