fleet Policy automations: run script

Goal

User story
As a Fleet user,
I want Fleet to a host failing a policy to trigger a script run on that host
so that I can automate host compliance w/o having to use a third-party automation tool (ex. Tines).

Context

Product designer: @marko-lisica

Changes

Product

[ ] UI changes: TODO
[ ] CLI usage changes: TODO
[ ] REST API changes: TODO
[ ] Permissions changes: TODO
[ ] Outdated documentation changes: TODO
[ ] Changes to paid features or tiers: TODO

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

Feb 23 '24 18:02 dherder

I would like to execute a script automatically when a policy fails instead of trigger a webhook.

@dherder we'll get to this but I think there's an iteration or two before we build it.

Currently, the customer can consume the failing policies webhook in Tines and execute a script using the Fleet API, right?

I think the first iteration will be sending a webhook per host that includes all the hosts failing policies. I think this simplifies the Tines story. The Tines story becomes this:

Receive new webhook that includes a specific host's failing policies
Loop through policies and take remediation action specific to each failing policy (via script or some other tool)

Feb 27 '24 14:02 noahtalerman

@noahtalerman would also be good to get a Fleet desktop notification on failed policies similar to https://github.com/fleetdm/fleet/issues/16264

Feb 29 '24 22:02 dherder

would also be good to get a Fleet desktop notification on failed policies

@dherder the current plan is to solve the problem of notifying the end user by getting in their calendar: #17230

Mar 01 '24 15:03 noahtalerman

@noahtalerman I see the calendar remediation as a separate issue. It works great when you want an end user to do a thing like update an app or perform an OS update. Where it doesn't work so great is if you want the remediation to be "execute a root level script", where if the user is a standard user, they just simply wouldn't be able to do it.

Mar 07 '24 19:03 dherder

Where it doesn't work so great is if you want the remediation to be "execute a root level script", where if the user is a standard user, they just simply wouldn't be able to do it.

@dherder I think the first iteration of "Fleet in your calendar" will address this.

The high level flow of the feature:

IT admin chooses which policies trigger calendar events
Calendar event is created when end user fails at least one of these policies
Webhook is fire when the calendar event starts
Automation tool (ex. Tines) receives the webhook and runs atuo-remediation (ex. script)

Check out the user story for more details on the flow: https://github.com/fleetdm/fleet/issues/17230

What do you think?

Also, we didn't have room for this "Auto remediation of policy failure" story in the current design sprint (4.48).

Mar 11 '24 20:03 noahtalerman

@noahtalerman it's still does not solve the problem of 3rd party solution integration that is a blocker for some of our current customers but especially prospective customers.

The expectation is that if Fleet has the script server-side & Fleet has a policy to check for a client state or attribute, that it would also have a way of executing the script on a policy failure without 3rd party integration required.

Couldn't Fleet just send the policy failure webhook to its own API endpoint for executing a script? Is there a technical concern like load on server due to script execution? Thanks.

cc @dherder @willmayhone88 @spokanemac @ksatter @pacamaster

May 02 '24 21:05 nonpunctual

@noahtalerman i presented the option of remediation through 3rd party automation tools today (IT buying scenario) and the feedback was that it would be a blocker to move forward with Fleet.

May 02 '24 21:05 dherder

Couldn't Fleet just send the policy failure webhook to its own API endpoint for executing a script? Is there a technical concern like load on server due to script execution? Thanks.

@nonpunctual no technical concern that I know of. It's just a matter of priorities/timing. Let's chat about it at feature fest!

May 07 '24 20:05 noahtalerman

csa:20240530

May 10 '24 15:05 nonpunctual

I'd like to see something like this with a drop-down next to each policy.

May 30 '24 20:05 spokanemac

Hey @dherder I updated this issue to user story format and moved your original issue description below for safekeeping. cc @marko-lisica

Problem

When a policy fails, Fleet can currently consume a webhook and send a response about the failures of the policy. Fleet can also provide guidance for the end user when a policy fails via Fleet Desktop.

Since we now have script execution capabilities, as an IT admin, I would like to execute a script automatically when a policy fails instead of trigger a webhook.

Potential solutions

In the automations dialog, have an extra option to "Run script".

Jun 25 '24 14:06 noahtalerman

Hey @zayhanlon and @dherder, we're dropping this one. The plan is to bring this one to the design sprint after the next. For more context see this doc.

Jul 11 '24 15:07 marko-lisica

Hey @randy-fleet chatted w/ @lukeheath and we decided to pull this one into the design sprint.

Like we discussed here, it's a requirement for #19372.

Design-wise I think we can borrow most of the UI/UX from the #19551 story.

If it's helpful, please feel free to throw some time on my calendar for Tues to chat.

Aug 30 '24 19:08 noahtalerman

Hey @dherder and @zayhanlon, heads up this didn't make the 3 week drafting timeline. We left it on the drafting board.

@lukeheath I think we want to bring this one through expedited drafting so that we can start working on it in the upcoming engineering sprint.

Sep 12 '24 14:09 noahtalerman

Hey @randy-fleet, I chatted w/ @sharon-fdm and @lukeheath and we decided to send this one to "Ready for spec" so we can unblock the #g-endpoint-ops team.

I unassigned you and moved your screens to the ready page in Figma.

The plan is for endpoint ops to estimate and start building w/ these screens + understanding that we want this feature to work like "Policy automations: install software" (#19551). Except now we're triggering script runs.

Randy, if you have any concerns w/ this plan, please let me know :)

Sep 13 '24 19:09 noahtalerman

@noahtalerman @marko-lisica a few questions that came up when this was estimated:

policy dropdown only shows compatible scripts, or all?
what does the global activity look like?
should we validate a script is selected if the checkbox is checked before allowing save?
is there an API design PR?
OK if script runs only the first time the policy fails, and only if it changes from passing to failing?

Sep 16 '24 18:09 rachaelshaw

Hey @rachaelshaw, thanks for fielding these during estimation!

policy dropdown only shows compatible scripts, or all?

What do we mean by "compatible"? My guess is we're talking about platforms. If policy's platforms are macOS, do we only show scripts for macOS.

If that's right, I think in this pass let's always show all scripts. I think this is consistent w/ showing all software for the install software automation.

what does the global activity look like?

If I'm understanding correctly, this is about generating activities when policy automations are updated. I think let's push that to a separate story to move quickly here.

@marko-lisica I added this to the "Update global activity feed" story (#21681) so we don't forget to get to it.

Screenshot 2024-09-16 at 2 17 25 PM

should we validate a script is selected if the checkbox is checked before allowing save?

What do we do for the install software policy automation? I think let's start w/ being consistent.

is there an API design PR?

No API design PR. The plan is to get the engineer's help on API design. I moved this checkbox along w/ the other TODOs to the engineering section.

OK if script runs only the first time the policy fails, and only if it changes from passing to failing?

What do we do for the install software policy automation? I think let's start w/ being consistent.

cc @lucasmrod @sharon-fdm

Sep 16 '24 18:09 noahtalerman

Given the ability to edit scripts via GitOps (the UI doesn't support this), should policies be reset when scripts are edited?

By way of comparison, policies are reset when "install this if the policy fails" is either added or removed, or if the software title referenced changes, but not if an installer itself is edited. The lack of change on installer edit might just be a miss on the original implementation though, since software installer edit and software installs on policy automation were in development at the same time.

So maybe we don't reset policies if the script gets edited via GitOps for now, then revise behavior for both installers and scripts at the same time as a fast follow?

Sep 25 '24 02:09 iansltx

@noahtalerman @lukeheath @marko-lisica I think this is a great call-out @iansltx - how does everyone feel about putting this in the admin's hand's with a banner / popup type thing, e.g.

"You are editing a script which is associtaed to a Fleet Policy. Editing the script may change the behavior of the Policy causing it to not generate a failure event. Based on the script changes would you like to reset the Policy?"

Sep 25 '24 13:09 nonpunctual

Right now we don't allow editing scripts via the UI so there's no place to put that banner. The only place that allows script edits is GitOps, so if we're putting a banner anywhere it would be in the guide about this.

Sep 25 '24 13:09 iansltx

@iansltx thanks for calling this out! And thanks for the @ mention @nonpunctual. I wouldn't have seen this otherwise.

maybe we don't reset policies if the script gets edited via GitOps for now, then revise behavior for both installers and scripts at the same time as a fast follow?

I think it's worth coming up w/ a solution now.

@marko-lisica can you please bring this through expedited drafting? See Brock's proposed solution here.

During design review, let's discuss your proposed solution and whether it makes to address it in this sprint v. in a fast follow (later iteration).

Sep 25 '24 15:09 noahtalerman

Well, @iansltx ~~this might be a good time to do~~ gentle reminder: https://github.com/fleetdm/fleet/issues/19925 :) @noahtalerman

Sep 25 '24 15:09 nonpunctual

@noahtalerman We don't support this for global policies, right?

Scripts are team-scoped, as are software install policy automations, so I'm assuming we can limit script executions to team policies, which allows us to use a script ID (and filter scripts for automation to the specific team) rather than doing something more indirect.

This means that inherited policies will not support script automations, but that's consistent with software install policy automations (just checked).

Sep 25 '24 15:09 iansltx

Related: https://github.com/fleetdm/fleet/issues/17993 https://github.com/fleetdm/fleet/issues/19925

Sep 25 '24 15:09 nonpunctual

@iansltx it makes sense to keep it consistent w/ the install software automation.

In the UI I think let's show a similar disabled state w/ tooltip when managed automations for "All teams":

Screenshot 2024-09-25 at 12 16 33 PM

Please let me know if that^ is missing in Figma. I can add it quickly!

Sep 25 '24 16:09 noahtalerman

@noahtalerman I think it needs to be a dev note in Figma, as the Figma designs only show the team-specific UX.

Sep 25 '24 16:09 iansltx

@iansltx I added the tooltips and dev notes here in Figma:

Screenshot 2024-09-25 at 1 54 19 PM

Sep 25 '24 17:09 noahtalerman

@noahtalerman Based on the comments above, I just added copy changes that we discussed during the review to a scratchpad here. I also created PR to update automatic install with policy automation guide here.

Could you take a look? What do you think?

Sep 26 '24 19:09 marko-lisica

@nonpunctual @iansltx For more context. We decided that it makes sense to reset the policy and keep it consistent with software install automation. For ex., if a user is testing a script and it's not working, once they upload a new one and tie it to a policy it will reset the count and it will run again on the failing hosts.

As the script can be edited only via GitOps, we don't have a way to show a warning. Once we get to FR to enable editing in the UI, we'll want to show the error message @nonpunctual proposed above.

Sep 26 '24 19:09 marko-lisica

@noahtalerman I think this makes sense. I didn't understand that until the comments about only managing scripts through GitOps. I have linked other issues related to script UI things & I think it will be great when we get to implement them. :)

Sep 26 '24 19:09 nonpunctual