fleet
fleet copied to clipboard
Policy automations: run script
Goal
User story |
---|
As a Fleet user, |
I want Fleet to a host failing a policy to trigger a script run on that host |
so that I can automate host compliance w/o having to use a third-party automation tool (ex. Tines). |
Context
- Product designer: @marko-lisica
Changes
Product
- [ ] UI changes: TODO
- [ ] CLI usage changes: TODO
- [ ] REST API changes: TODO
- [ ] Permissions changes: TODO
- [ ] Outdated documentation changes: TODO
- [ ] Changes to paid features or tiers: TODO
Engineering
- [ ] Database schema migrations: TODO
- [ ] Load testing: TODO
ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".
QA
Risk assessment
- Requires load testing: TODO
- Risk level: Low / High TODO
- Risk description: TODO
Manual testing steps
- Step 1
- Step 2
- Step 3
Testing notes
Confirmation
- [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
- [ ] QA (@____): Added comment to user story confirming successful completion of QA.
I would like to execute a script automatically when a policy fails instead of trigger a webhook.
@dherder we'll get to this but I think there's an iteration or two before we build it.
Currently, the customer can consume the failing policies webhook in Tines and execute a script using the Fleet API, right?
I think the first iteration will be sending a webhook per host that includes all the hosts failing policies. I think this simplifies the Tines story. The Tines story becomes this:
- Receive new webhook that includes a specific host's failing policies
- Loop through policies and take remediation action specific to each failing policy (via script or some other tool)
@noahtalerman would also be good to get a Fleet desktop notification on failed policies similar to https://github.com/fleetdm/fleet/issues/16264
would also be good to get a Fleet desktop notification on failed policies
@dherder the current plan is to solve the problem of notifying the end user by getting in their calendar: #17230
@noahtalerman I see the calendar remediation as a separate issue. It works great when you want an end user to do a thing like update an app or perform an OS update. Where it doesn't work so great is if you want the remediation to be "execute a root level script", where if the user is a standard user, they just simply wouldn't be able to do it.
Where it doesn't work so great is if you want the remediation to be "execute a root level script", where if the user is a standard user, they just simply wouldn't be able to do it.
@dherder I think the first iteration of "Fleet in your calendar" will address this.
The high level flow of the feature:
- IT admin chooses which policies trigger calendar events
- Calendar event is created when end user fails at least one of these policies
- Webhook is fire when the calendar event starts
- Automation tool (ex. Tines) receives the webhook and runs atuo-remediation (ex. script)
Check out the user story for more details on the flow: https://github.com/fleetdm/fleet/issues/17230
What do you think?
Also, we didn't have room for this "Auto remediation of policy failure" story in the current design sprint (4.48).
@noahtalerman it's still does not solve the problem of 3rd party solution integration that is a blocker for some of our current customers but especially prospective customers.
The expectation is that if Fleet has the script server-side & Fleet has a policy to check for a client state or attribute, that it would also have a way of executing the script on a policy failure without 3rd party integration required.
Couldn't Fleet just send the policy failure webhook to its own API endpoint for executing a script? Is there a technical concern like load on server due to script execution? Thanks.
cc @dherder @willmayhone88 @spokanemac @ksatter @pacamaster
@noahtalerman i presented the option of remediation through 3rd party automation tools today (IT buying scenario) and the feedback was that it would be a blocker to move forward with Fleet.
Couldn't Fleet just send the policy failure webhook to its own API endpoint for executing a script? Is there a technical concern like load on server due to script execution? Thanks.
@nonpunctual no technical concern that I know of. It's just a matter of priorities/timing. Let's chat about it at feature fest!
csa:20240530
I'd like to see something like this with a drop-down next to each policy.
Hey @dherder I updated this issue to user story format and moved your original issue description below for safekeeping. cc @marko-lisica
Problem
When a policy fails, Fleet can currently consume a webhook and send a response about the failures of the policy. Fleet can also provide guidance for the end user when a policy fails via Fleet Desktop.
Since we now have script execution capabilities, as an IT admin, I would like to execute a script automatically when a policy fails instead of trigger a webhook.
Potential solutions
In the automations dialog, have an extra option to "Run script".
Hey @zayhanlon and @dherder, we're dropping this one. The plan is to bring this one to the design sprint after the next. For more context see this doc.
Hey @randy-fleet chatted w/ @lukeheath and we decided to pull this one into the design sprint.
Like we discussed here, it's a requirement for #19372.
Design-wise I think we can borrow most of the UI/UX from the #19551 story.
If it's helpful, please feel free to throw some time on my calendar for Tues to chat.
Hey @dherder and @zayhanlon, heads up this didn't make the 3 week drafting timeline. We left it on the drafting board.
@lukeheath I think we want to bring this one through expedited drafting so that we can start working on it in the upcoming engineering sprint.
Hey @randy-fleet, I chatted w/ @sharon-fdm and @lukeheath and we decided to send this one to "Ready for spec" so we can unblock the #g-endpoint-ops
team.
I unassigned you and moved your screens to the ready page in Figma.
The plan is for endpoint ops to estimate and start building w/ these screens + understanding that we want this feature to work like "Policy automations: install software" (#19551). Except now we're triggering script runs.
Randy, if you have any concerns w/ this plan, please let me know :)
@noahtalerman @marko-lisica a few questions that came up when this was estimated:
- policy dropdown only shows compatible scripts, or all?
- what does the global activity look like?
- should we validate a script is selected if the checkbox is checked before allowing save?
- is there an API design PR?
- OK if script runs only the first time the policy fails, and only if it changes from passing to failing?
Hey @rachaelshaw, thanks for fielding these during estimation!
policy dropdown only shows compatible scripts, or all?
What do we mean by "compatible"? My guess is we're talking about platforms. If policy's platforms are macOS, do we only show scripts for macOS.
If that's right, I think in this pass let's always show all scripts. I think this is consistent w/ showing all software for the install software automation.
what does the global activity look like?
If I'm understanding correctly, this is about generating activities when policy automations are updated. I think let's push that to a separate story to move quickly here.
@marko-lisica I added this to the "Update global activity feed" story (#21681) so we don't forget to get to it.
should we validate a script is selected if the checkbox is checked before allowing save?
What do we do for the install software policy automation? I think let's start w/ being consistent.
is there an API design PR?
No API design PR. The plan is to get the engineer's help on API design. I moved this checkbox along w/ the other TODOs to the engineering section.
OK if script runs only the first time the policy fails, and only if it changes from passing to failing?
What do we do for the install software policy automation? I think let's start w/ being consistent.
cc @lucasmrod @sharon-fdm
Given the ability to edit scripts via GitOps (the UI doesn't support this), should policies be reset when scripts are edited?
By way of comparison, policies are reset when "install this if the policy fails" is either added or removed, or if the software title referenced changes, but not if an installer itself is edited. The lack of change on installer edit might just be a miss on the original implementation though, since software installer edit and software installs on policy automation were in development at the same time.
So maybe we don't reset policies if the script gets edited via GitOps for now, then revise behavior for both installers and scripts at the same time as a fast follow?
@noahtalerman @lukeheath @marko-lisica I think this is a great call-out @iansltx - how does everyone feel about putting this in the admin's hand's with a banner / popup type thing, e.g.
"You are editing a script which is associtaed to a Fleet Policy. Editing the script may change the behavior of the Policy causing it to not generate a failure event. Based on the script changes would you like to reset the Policy?"
Right now we don't allow editing scripts via the UI so there's no place to put that banner. The only place that allows script edits is GitOps, so if we're putting a banner anywhere it would be in the guide about this.
@iansltx thanks for calling this out! And thanks for the @ mention @nonpunctual. I wouldn't have seen this otherwise.
maybe we don't reset policies if the script gets edited via GitOps for now, then revise behavior for both installers and scripts at the same time as a fast follow?
I think it's worth coming up w/ a solution now.
@marko-lisica can you please bring this through expedited drafting? See Brock's proposed solution here.
During design review, let's discuss your proposed solution and whether it makes to address it in this sprint v. in a fast follow (later iteration).
Well, @iansltx ~~this might be a good time to do~~ gentle reminder: https://github.com/fleetdm/fleet/issues/19925 :) @noahtalerman
@noahtalerman We don't support this for global policies, right?
Scripts are team-scoped, as are software install policy automations, so I'm assuming we can limit script executions to team policies, which allows us to use a script ID (and filter scripts for automation to the specific team) rather than doing something more indirect.
This means that inherited policies will not support script automations, but that's consistent with software install policy automations (just checked).
Related: https://github.com/fleetdm/fleet/issues/17993 https://github.com/fleetdm/fleet/issues/19925
@iansltx it makes sense to keep it consistent w/ the install software automation.
In the UI I think let's show a similar disabled state w/ tooltip when managed automations for "All teams":
Please let me know if that^ is missing in Figma. I can add it quickly!
@noahtalerman I think it needs to be a dev note in Figma, as the Figma designs only show the team-specific UX.
@noahtalerman Based on the comments above, I just added copy changes that we discussed during the review to a scratchpad here. I also created PR to update automatic install with policy automation guide here.
Could you take a look? What do you think?
@nonpunctual @iansltx For more context. We decided that it makes sense to reset the policy and keep it consistent with software install automation. For ex., if a user is testing a script and it's not working, once they upload a new one and tie it to a policy it will reset the count and it will run again on the failing hosts.
As the script can be edited only via GitOps, we don't have a way to show a warning. Once we get to FR to enable editing in the UI, we'll want to show the error message @nonpunctual proposed above.
@noahtalerman I think this makes sense. I didn't understand that until the comments about only managing scripts through GitOps. I have linked other issues related to script UI things & I think it will be great when we get to implement them. :)