fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Improve queued activity/policy reporting responsiveness of agent after installs/script runs

Open iansltx opened this issue 10 months ago • 7 comments

Goal

User story
As a an IT admin,
I want to see the impacts of script runs and software installs on their associated policies quickly
so that I can confirm that automated remediations are working without waiting an hour.

Key result

Tighter feedback loop for policies so when action is taken that would make the policy pass it passes then rather than an hour later.

Context

  • Product Designer: @iansltx

When switching to the unified queue, the Fleet agent may be slower to process queued activities, as to maintain ordering activities are queued one at a time.

Even without the unified queue, policies designed to remediate an issue with a script run or software install won't get the status updated for that policy until (by default) an hour after running the failing query that initiated the script run or install.

More context from @getvictor:

This makes sense, and we talked about batching the same types of commands. For example, if there are 3 Apple MDM commands in a row, we will transfer them as a block to the nano queues.

Similarly, if we have 3 scripts in a row, we can allow fleetd to get all of them as a batch, and let fleetd handle the order.

More context from @gillespi314:

As I mentioned in the [Eng Together 2025-01-29] meeting chat, we’ll probably need to give some careful thought on the agent side on the mechanism to allow for triggering calls to the /config endpoint (i.e. the Orbit check-in endpoint) from inside various config receivers (such as the ones in /orbit/pkg/update/notifications.go ). As it currently stands, these API calls are controlled centrally by a top-level OrbitClient.

Changes

Product

  • [ ] UI changes: No changes. Automated policy pass/fail will be fresher but there won't be any frontend changes.

  • [ ] REST API changes: No changes to public endpoints.

  • [ ] Fleet's agent (fleetd) changes: When an install or script run posts results, check if the response to that request contains a query to run. If a query is provided, have osquery run the query and report back on the distributed endpoint just for that query. Additionally, check in with the primary Orbit status endpoint to see if there are more installs to perform/scripts to run, rather than waiting for the normal check-in interval.

As an alternative to the above (if it's difficult for orbit to tell osquery to run a query), we could revise our policy machinery to mark a policy query as due (in policyQueriesForHost) when we get a successful result from the associated policy automation, which would require no fleetd changes, but wouldn't process script/install queues as quickly, and wouldn't update policies quite as quickly.

We could also implement a hybrid/combination of both solutions for the best fleetd compatibility and performance (and it should be easy to make these changes compatible with e.g. new Fleet servers/old fleetd or vice versa).

  • [ ] First draft of test plan added
  • [ ] Other reference documentation changes: Might want to update the policy implementation details article
  • [ ] Once shipped, requester has been notified
  • [ ] Once shipped, dogfooding issue has been filed

Engineering

  • [ ] Test plan is finalized
  • [ ] Contributor API changes: For the more performant (first) implementation option, add query metadata (default empty) to Orbit install and script result endpoint responses. For the alternative implementation, no API changes.
  • [ ] Feature guide changes: No changes.
  • [ ] Load testing: Add a large number of hosts to a team that has a policy that fails before a script has executed (e.g. placing a file) and succeeds after the script has executed. We'll need to tweak osquery-perf to simulate this behavior. Load impact for the accelerated query interaction should be negligible.

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: Yes
  • Risk level: Low

Test plan

  1. Add four policies. One with a query that fails before the attached automated script runs, succeeds after. One with a query that fails before the attached software install runs, succeeds after. One with a query that fails both before and after a script runs. One with a script that fails.
  2. Enroll a host. Two scripts and an install should queue.
  3. Once each successful action completes, confirm that the associated policy status has been updated (two policies to passing, one staying failing), without the need for a resync. The failed script policy should not have updated.
  4. If the more performant implementation was used, HTTP traffic to the Fleet server should include an Orbit check-in immediately after each script or installer result request is sent.

Testing notes

Confirmation

  1. [ ] Engineer: Added comment to user story confirming successful completion of test plan.
  2. [ ] QA: Added comment to user story confirming successful completion of test plan.

iansltx avatar Feb 02 '25 01:02 iansltx

Needs to be updated to new process: https://fleetdm.com/handbook/engineering#create-an-engineering-initiated-story.

lukeheath avatar Mar 31 '25 19:03 lukeheath

@lukeheath How's this look?

iansltx avatar Apr 08 '25 03:04 iansltx

@iansltx Format looks good! Does the fleetctl trigger command allow you to trigger the policy runs manually? I'm wondering if that is a workable interim solution.

lukeheath avatar Apr 08 '25 19:04 lukeheath

@lukeheath it does not, as policy queries are queued based on freshness per-host, rather than globally. Good for load, but this means there's no "push policy queries everywhere" cron. We do have a cron for calculating policy stats but that's independent of the queries themselves.

iansltx avatar Apr 08 '25 21:04 iansltx

@iansltx If a contributor or a user needs fresher data, can they use the /refetch endpoint? That does re-evaluate policies on the next host check-in.

lukeheath avatar Apr 09 '25 22:04 lukeheath

@lukeheath Yep! The only caveat there is that a full refetch is heavier than evaluating a single policy query.

FWIW IMO this issue is currently a nice-to-have rather than a burning need based on what I've seen of competitors' solutions and customer feedback. This might move up the priority queue as auto-install/patching, particularly in the context of vulnerability remediation, gets more actively used, as IT admins would expect any automatic remediations to also automatically flip the affected policies to green.

iansltx avatar Apr 09 '25 22:04 iansltx

@iansltx Got it. Agreed, my feeling is this is a nice-to-have while we have other need-to-haves to prioritize. Good ticket to keep open as it's a definite spot for improvement in the product that doesn't require product changes.

lukeheath avatar Apr 09 '25 22:04 lukeheath

I won't be asking customer-cisneros for a snippet on this, but this came up as a large pain point while testing a relatively complicated policy where we had multiple possible states to test. Refetching the host manually works, but takes much longer than re-evaluating a single policy would, and results in unnecessary resource consumption both on the host and in Fleet.

ksatter avatar May 30 '25 18:05 ksatter