fleet Maintenance windows (Fleet in your calendar)

Goal

User story 1
As an IT admin,
I want Fleet to create an event in my end users' calendars if they're failing policies
so that I don't need to nudge them at inconvenient times when they're failing policies.

User story 2
As a security engineer,
I want Fleet to create an event in my end users' calendars if they're failing policies
so that I don't have to allowlist the CEO and worry that they'll never update.

Context

Product designer: @rachaelshaw

Changes

Product

[x] UI changes: Figma
[x] REST API changes: Draft PR
[ ] Permissions changes: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.
[ ] Outdated documentation changes:
- [ ] REST API docs: See draft PR
- [ ] Update policy automations docs with a new "Calendar events" section. Keep this section as short as possible and link to the article.
- [ ] Scan the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).
[x] Website redirects:
- [x] fleetdm.com/learn-more-about/creating-service-accounts
- [x] fleetdm.com/learn-more-about/calendar-events: Article on fleetdm.com
[x] Changes to paid features or tiers: Calendar integration is available in Fleet Premium

Engineering

Technical discussion is summarized in this document.
[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

In addition to what's below, see section 10 of the eng doc

Risk assessment

This feature adds:

DB migration and tables.
A job to go over all hosts and:
- schedule events (as Calendar meeting slots).
- Monitor slots of all host for changes
Risk level: High
Risk description: The main risk will be at the performance level. Other Risk will be at logical bugs level, or potentially interference with other jobs on base of performence and DB access.

Manual testing steps

Requires load testing: Yes. Need to validate:
- No harm done to other jobs and DB access.
- The feature works properly with many hosts.
  - Max 20,000 hosts with calendar event on the same time and webhook firing for all. (@noahtalerman to update this number if needed)

New things we will need to check/do:

Have a lot of agents report bad policies so we can schedule addressing slots for them. Agents will need to switch from bad to good.
create a google calendar environment with many users (thousands?)
A way to check that a meeting slots were actually set or when moved they are addressed. ( @xpkoala to add )

Configuring load test with real calendar

Enabling plus addressing (so [email protected] is treated like [email protected]) by setting the undocumented env variable FLEET_GOOGLE_CALENDAR_PLUS_ADDRESSING=1
Create a team policy that will always fail with osquery-perf hosts containing query: select 0
Enroll osquery-perf hosts to that team
Decide on real people who will donate their calendars for load testing
Update the emails on the hosts using a script calling PUT fleet/hosts/:id/device_mapping and using plus addressing to ensure emails are unique
Enable calendar integration globally by using JSON from 1Password (Fleet in your calendar service account)
Enable team calendar integration for the failing policy
Were all events created in a reasonable time?
Is cron job running every 5 minutes, or does it need much longer to finish?

Configuring load test with mock calendar

Create a team policy that will always fail with osquery-perf hosts containing query: select 0
Enroll osquery-perf hosts to that team
Update the emails on the hosts using a script calling PUT fleet/hosts/:id/device_mapping
Start up and configure mock calendar server. See /tools/calendar/README.md
After events are created, move them to the current time to test webhooks firing. Did all webhooks fire in a reasonable time?
- Note: The calendar cron job only checks calendar events every 30 minutes. May need to update MySQL calendar_events updated_at time to force a sooner check.

Modifying calendar event

The user should be able to modify the calendar event. Some situations to test:

Move event to the past -> Fleet should create a new event
Make event all-day -> Fleet should create a new event
Make the event 0 minutes long -> Fleet should fire webhook within the first 5 minutes of event starting
Add a guest to the event and decline yourself -> Fleet still treats this event as valid
Change timezone of the event -> Fleet still treats this event as valid, and webhook should fire at the right time
Move the event to a different calendar -> Fleet should create a new event (Fleet only has access to user's primary calendar)

Cleanup

if global setting is removed, all calendar events from MySQL DB are removed
if team setting is disabled, all calendar events for that team are deleted
calendar_events that have not been updated in 48 hours are deleted (updated_at column)

Interesting corner cases

User (email) has 2 hosts on separate teams that are failing policies -- only 1 event for 1 host should be created, and 1 webhook fired.
Host email changes to another user -- the cleanup job should delete the existing event. A new one is created if one doesn't exist already.
Host transferred to another team -- the cleanup job should delete the existing event.

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

Feb 28 '24 14:02 noahtalerman

is there a point at which this will add .ics files sent to users as part of the feature? The vast majority of enterprises use Microsoft email services.

Mar 06 '24 15:03 nonpunctual

is there a point at which this will add .ics files sent to users as part of the feature?

@nonpunctual yes. It's likely Outlook will likely come after Google Calendar.

Mar 06 '24 15:03 noahtalerman

Hey @sharon-fdm I moved your Figma comments below.

Please ask questions and make comments in the GitHub issue here so they're all in one place and easy to find :)

Screenshot 2024-03-06 at 10 35 02 AM

Yes. The plan is to create one meeting for each end user. Even if they have more than one host.

Screenshot 2024-03-06 at 10 36 12 AM

Correct.

In Fleet, a host can have many emails (end users) associated with it.

Fleet will filter this list of emails by emails w/ the matching domain configured by the IT admin (see where the IT admin will configure the domain in Figma here)

cc @getvictor

Mar 06 '24 15:03 noahtalerman

cc @rachaelshaw ^^ (forgot to @ mention you)

Mar 06 '24 15:03 noahtalerman

Thanks @noahtalerman.

Mar 06 '24 15:03 sharon-fdm

Screenshot 2024-03-06 at 10 41 38 AM

@getvictor you made me realize we could simplify the current plan:

Screenshot 2024-03-06 at 10 41 51 AM

Instead of only scheduling calendar events on Monday (after first enabling), if the end user doesn't have a calendar event, we always schedule one on the upcoming Friday.

Here's how that would look:

Screenshot 2024-03-06 at 10 48 29 AM

This makes the experience more consistent for the end user and IT admin. The expectation becomes, once the end user starts failing one or more policies (no matter what day it is), the calendar event is going to show up on Friday.

@rachaelshaw and Victor, what do y'all think?

Mar 06 '24 15:03 noahtalerman

Screenshot 2024-03-06 at 10 50 43 AM

@getvictor this makes sense to me.

If we removed the calendar event, after the event has started then I think I would be confused as an end user. Did the IT team do it's thing?

Mar 06 '24 15:03 noahtalerman

Sorry to barge in here... maybe I am misunderstanding the intent.

My opinion is that Friday is not a great day to use as a default. Lots of orgs:

don't do a lot on Fridays
don't have people come in on Fridays
have specific policies against doing important things on Fridays
- this includes NOT doing stuff to people's computers
- why? because if something goes wrong it means someone will have to work on the weekend
Microsoft Patch Tuesday is a thing for a reason...

Ideally, the feature should allow admins to pick their default / starting day.

Thanks.

Mar 06 '24 15:03 nonpunctual

I did not go looking for this article... It's # 1 on Hacker News: https://deploybot.com/blog/no-deployments-on-fridays-a-good-practice-for-software-development-teams

Mar 07 '24 14:03 nonpunctual

@getvictor added copy for the error message if someone tries to configure >1 service account: https://www.figma.com/file/p81nWodxL04YD7iNyr9xVa/%2317230-Fleet-in-your-calendar?type=design&node-id=550%3A12396&mode=design&t=q38NTaiMoLDwN89S-1

Mar 12 '24 17:03 rachaelshaw

Hey @sharon-fdm, should we move this to the release board and assign an engineer?

Looks like this user story is estimated and being worked on (all subtasks are on the release board).

cc @rachaelshaw

Mar 12 '24 20:03 noahtalerman

@noahtalerman Yes. We forgot to move the top story. THX! Done.

Mar 12 '24 20:03 sharon-fdm

Hey @mostlikelee I pulled your Figma comment below.

Please add questions/comments in the GitHub issue here so they're all in one place and easy to find :)

Screenshot 2024-03-14 at 8 51 23 AM

Yes, I think that's the plan.

My understanding is that the host will be offline (no webhook) => cleanup job runs and removes the end user's calendar event => "failing policies check/calendar event" job runs and creates a new event for the next month (3rd Tues).

@getvictor please correct me if I'm wrong.

Mar 14 '24 12:03 noahtalerman

@noahtalerman @mostlikelee @lucasmrod

The job runs every 5 minutes, which means it will run 5-6 times during the actual meeting time, and we will check if the host is online every time. (Note: We should only check Google Calendar once, and not 5-6 times, keeping with our 30-minute cadence.)

Then, when the job runs the 7th time, the event is in the past now, but it is still the 3rd Tue (assuming the event started at 9 am). So we should schedule it for the 3rd Tue a month from now. If we schedule it for the next day, we risk spamming the calendar and annoying users.

As opposed to the deletion flow, where the user deliberately deletes the event, in which case we will reschedule for the next business day (3rd Wed, Thu, Fri, Mon, etc.)

Mar 14 '24 13:03 getvictor

Calendar Email Address: I believe we'll attempt to get the user email addr from host_emails but it seems a host could have multiple emails associated, which will we choose?

from the API docs for GET /api/v1/fleet/hosts/1/device_mapping :

{
  "host_id": 1,
  "device_mapping": [
    {
      "email": "[email protected]",
      "source": "identity_provider"
    },
    {
      "email": "[email protected]",
      "source": "google_chrome_profiles"
    },
    {
      "email": "[email protected]",
      "source": "custom"
    }
  ]
}

Mar 14 '24 14:03 mostlikelee

Is refetching the host before triggering the webhook a must?

Refetching is an expensive (DB write) operation (if done at the same time by thousands of hosts).
Policy results are refreshed every 1 hour.

How much of a common case is a user fixing its policies 1 hour before the calendar event? Is it a problem if a webhook is triggered on a healthy host (because the policies were fixed e.g. 30m before the calendar event)?

@getvictor

Mar 14 '24 16:03 lucasmrod

Is refetching the host before triggering the webhook a must?

Refetching is an expensive (DB write) operation (if done at the same time by thousands of hosts).

Policy results are refreshed every 1 hour.

How much of a common case is a user fixing its policies 1 hour before the calendar event? Is it a problem if a webhook is triggered on a healthy host (because the policies were fixed e.g. 30m before the calendar event)?

@getvictor

We believe it will be common for users to fix their policy issues right before the event starts. So the user must also know to refresh their host on the My Device page for the DB to be up to date.

If we fire webhook on a healthy host, then IT may restart their computer for no reason.

Another option is to adjust the policy refresh time ahead of time, so that it is done right before the event.

And then we can also look when policy results were refreshed, and not refetch if it was done within last 5-10 minutes.

It is true that if we have 10K+ hosts with 9am calendar events, then we may be overloaded. Perhaps we should spread out the calendar events?

cc: @noahtalerman

Mar 14 '24 16:03 getvictor

Plus refetching means (if we implement to refetch only policies):

Set a refetch flag for policies, then wait 10s (or whatever the distributed_interval may be, some customers have this be a higher value, e.g. 60s) for the host to have pushed the new results of the policy queries. So for 10k hosts at 10s distributed interval this job might take 27 hours? (Or if we do a refetch in batch it might be an expensive operation on the DB writer).

Mar 14 '24 19:03 lucasmrod

@noahtalerman If user modifies the event to an all-day event, we will treat it as deleted -- it is a bit of a hassle trying to support this since we need the user's timezone, and the user can change the timezone of their calendar without actually modifying the event.

Mar 15 '24 16:03 getvictor

@lucasmrod If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time? cc: @noahtalerman

Mar 15 '24 17:03 getvictor

If user modifies the event to an all-day event, we will treat it as deleted

@getvictor if I'm understanding, in this case, we'll schedule a new event on top of the all-day event?

Like this: Screenshot 2024-03-15 at 4 01 44 PM

I think that's ok.

Mar 15 '24 20:03 noahtalerman

If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time?

@getvictor we could also treat it as deleted. What's easier?

Mar 15 '24 20:03 noahtalerman

If user modifies the event to an all-day event, we will treat it as deleted

@getvictor if I'm understanding, in this case, we'll schedule a new event on top of the all-day event?

Like this:

I think that's ok.

We'll schedule the event for the next day from the originally scheduled time.

If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time?

@getvictor we could also treat it as deleted. What's easier?

Treating it as deleted is easier. But on my calendar, a 0-minute event looks the same as a 30-minute event. So, it would seem confusing why another event was created if one is already on the schedule.

Mar 15 '24 20:03 getvictor

@noahtalerman @lucasmrod If the event has started, should we refetch it from Google calendar to make sure it wasn't moved/deleted at the last minute? Currently, we only refetch every 30 minutes.

Mar 18 '24 13:03 getvictor

Treating it as deleted is easier.

@getvictor I think we can treat it as deleted for now. As long as we're not closing the door on changing this behavior later.

If the event has started, should we refetch it from Google calendar to make sure it wasn't moved/deleted at the last minute?

@getvictor I think yes. Imagine Mike has an "oh shoot" moment 15 mins before the board meeting starts so he moves the event.

Mar 19 '24 14:03 noahtalerman

@noahtalerman @lucasmrod What if a person has multiple hosts on different teams? It seems like they should have 2 calendar events (but they can be simultaneous) because different webhooks need to be fired. Or do you prefer 1 event, and we fire 2+ webhooks in the backend? I think we can leave handling this for the next version, but we might want to do some DB changes/assumptions now.

Mar 21 '24 14:03 getvictor

What if a person has multiple hosts on different teams?

@getvictor I think we'll want one calendar event per user. One 30 minute downtime event to autoremediate all my devices.

Mar 21 '24 20:03 noahtalerman

From customer conversation:

Apr 05 '24 21:04 nonpunctual

[ ] Update policy automations docs

[ ] Permissions changes: TODO - add "Manage access" page changes to draft PR

@noahtalerman, reminder to update the docs w/ link to videos to set up Fleet in your calendar: https://www.loom.com/share/9fbdff2998be4877b95ec6702c6c062c?sid=6602e703-aa5a-450a-b092-b5d28eb6e311

Apr 11 '24 18:04 noahtalerman

[ ] Update policy automations docs

TODO: Worth doing a scan of the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).

TODO: What would happen if you enable calendar automations for a "No team" policy? Add something to GitOps that adds a global policy w/ automations enabled.

[ ] Permissions changes: TODO - add "Manage access" page changes to draft PR

TODO: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.

@rachaelshaw when you get the chance, can you please take these on? Thanks!

Apr 19 '24 19:04 noahtalerman

fleet fleet copied to clipboard

Maintenance windows (Fleet in your calendar)

Goal

Context

Changes

Product

Engineering

QA

Risk assessment

Manual testing steps

Configuring load test with real calendar

Configuring load test with mock calendar

Modifying calendar event

Cleanup

Interesting corner cases

Testing notes

Confirmation

fleet
fleet copied to clipboard