fleet
fleet copied to clipboard
Maintenance windows (Fleet in your calendar)
Goal
User story 1 |
---|
As an IT admin, |
I want Fleet to create an event in my end users' calendars if they're failing policies |
so that I don't need to nudge them at inconvenient times when they're failing policies. |
User story 2 |
---|
As a security engineer, |
I want Fleet to create an event in my end users' calendars if they're failing policies |
so that I don't have to allowlist the CEO and worry that they'll never update. |
Context
- Product designer: @rachaelshaw
Changes
Product
- [x] UI changes: Figma
- [x] REST API changes: Draft PR
- [ ] Permissions changes: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.
- [ ] Outdated documentation changes:
- [ ] REST API docs: See draft PR
- [ ] Update policy automations docs with a new "Calendar events" section. Keep this section as short as possible and link to the article.
- [ ] Scan the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).
- [x] Website redirects:
- [x] fleetdm.com/learn-more-about/creating-service-accounts
- [x] fleetdm.com/learn-more-about/calendar-events: Article on fleetdm.com
- [x] Changes to paid features or tiers: Calendar integration is available in Fleet Premium
Engineering
- Technical discussion is summarized in this document.
- [ ] Database schema migrations: TODO
- [ ] Load testing: TODO
ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".
QA
In addition to what's below, see section 10 of the eng doc
Risk assessment
This feature adds:
-
DB migration and tables.
-
A job to go over all hosts and:
- schedule events (as Calendar meeting slots).
- Monitor slots of all host for changes
-
Risk level: High
-
Risk description: The main risk will be at the performance level. Other Risk will be at logical bugs level, or potentially interference with other jobs on base of performence and DB access.
Manual testing steps
- Requires load testing: Yes. Need to validate:
- No harm done to other jobs and DB access.
- The feature works properly with many hosts.
- Max 20,000 hosts with calendar event on the same time and webhook firing for all. (@noahtalerman to update this number if needed)
New things we will need to check/do:
- Have a lot of agents report bad policies so we can schedule addressing slots for them. Agents will need to switch from bad to good.
- create a google calendar environment with many users (thousands?)
- A way to check that a meeting slots were actually set or when moved they are addressed. ( @xpkoala to add )
Configuring load test with real calendar
- Enabling plus addressing (so [email protected] is treated like [email protected]) by setting the undocumented env variable
FLEET_GOOGLE_CALENDAR_PLUS_ADDRESSING=1
- Create a team policy that will always fail with osquery-perf hosts containing query:
select 0
- Enroll osquery-perf hosts to that team
- Decide on real people who will donate their calendars for load testing
- Update the emails on the hosts using a script calling
PUT fleet/hosts/:id/device_mapping
and using plus addressing to ensure emails are unique - Enable calendar integration globally by using JSON from 1Password (Fleet in your calendar service account)
- Enable team calendar integration for the failing policy
- Were all events created in a reasonable time?
- Is cron job running every 5 minutes, or does it need much longer to finish?
Configuring load test with mock calendar
- Create a team policy that will always fail with osquery-perf hosts containing query:
select 0
- Enroll osquery-perf hosts to that team
- Update the emails on the hosts using a script calling
PUT fleet/hosts/:id/device_mapping
- Start up and configure mock calendar server. See /tools/calendar/README.md
- After events are created, move them to the current time to test webhooks firing. Did all webhooks fire in a reasonable time?
- Note: The calendar cron job only checks calendar events every 30 minutes. May need to update MySQL calendar_events updated_at time to force a sooner check.
Modifying calendar event
The user should be able to modify the calendar event. Some situations to test:
- Move event to the past -> Fleet should create a new event
- Make event all-day -> Fleet should create a new event
- Make the event 0 minutes long -> Fleet should fire webhook within the first 5 minutes of event starting
- Add a guest to the event and decline yourself -> Fleet still treats this event as valid
- Change timezone of the event -> Fleet still treats this event as valid, and webhook should fire at the right time
- Move the event to a different calendar -> Fleet should create a new event (Fleet only has access to user's primary calendar)
Cleanup
- if global setting is removed, all calendar events from MySQL DB are removed
- if team setting is disabled, all calendar events for that team are deleted
- calendar_events that have not been updated in 48 hours are deleted (updated_at column)
Interesting corner cases
- User (email) has 2 hosts on separate teams that are failing policies -- only 1 event for 1 host should be created, and 1 webhook fired.
- Host email changes to another user -- the cleanup job should delete the existing event. A new one is created if one doesn't exist already.
- Host transferred to another team -- the cleanup job should delete the existing event.
Testing notes
Confirmation
- [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
- [ ] QA (@____): Added comment to user story confirming successful completion of QA.
is there a point at which this will add .ics files sent to users as part of the feature? The vast majority of enterprises use Microsoft email services.
is there a point at which this will add .ics files sent to users as part of the feature?
@nonpunctual yes. It's likely Outlook will likely come after Google Calendar.
Hey @sharon-fdm I moved your Figma comments below.
Please ask questions and make comments in the GitHub issue here so they're all in one place and easy to find :)
Yes. The plan is to create one meeting for each end user. Even if they have more than one host.
Correct.
In Fleet, a host can have many emails (end users) associated with it.
Fleet will filter this list of emails by emails w/ the matching domain configured by the IT admin (see where the IT admin will configure the domain in Figma here)
cc @getvictor
cc @rachaelshaw ^^ (forgot to @ mention you)
Thanks @noahtalerman.
@getvictor you made me realize we could simplify the current plan:
Instead of only scheduling calendar events on Monday (after first enabling), if the end user doesn't have a calendar event, we always schedule one on the upcoming Friday.
Here's how that would look:
This makes the experience more consistent for the end user and IT admin. The expectation becomes, once the end user starts failing one or more policies (no matter what day it is), the calendar event is going to show up on Friday.
@rachaelshaw and Victor, what do y'all think?
@getvictor this makes sense to me.
If we removed the calendar event, after the event has started then I think I would be confused as an end user. Did the IT team do it's thing?
Sorry to barge in here... maybe I am misunderstanding the intent.
My opinion is that Friday is not a great day to use as a default. Lots of orgs:
- don't do a lot on Fridays
- don't have people come in on Fridays
- have specific policies against doing important things on Fridays
- this includes NOT doing stuff to people's computers
- why? because if something goes wrong it means someone will have to work on the weekend
- Microsoft Patch Tuesday is a thing for a reason...
Ideally, the feature should allow admins to pick their default / starting day.
Thanks.
I did not go looking for this article... It's # 1 on Hacker News: https://deploybot.com/blog/no-deployments-on-fridays-a-good-practice-for-software-development-teams
@getvictor added copy for the error message if someone tries to configure >1 service account: https://www.figma.com/file/p81nWodxL04YD7iNyr9xVa/%2317230-Fleet-in-your-calendar?type=design&node-id=550%3A12396&mode=design&t=q38NTaiMoLDwN89S-1
Hey @sharon-fdm, should we move this to the release board and assign an engineer?
Looks like this user story is estimated and being worked on (all subtasks are on the release board).
cc @rachaelshaw
@noahtalerman Yes. We forgot to move the top story. THX! Done.
Hey @mostlikelee I pulled your Figma comment below.
Please add questions/comments in the GitHub issue here so they're all in one place and easy to find :)
Yes, I think that's the plan.
My understanding is that the host will be offline (no webhook) => cleanup job runs and removes the end user's calendar event => "failing policies check/calendar event" job runs and creates a new event for the next month (3rd Tues).
@getvictor please correct me if I'm wrong.
@noahtalerman @mostlikelee @lucasmrod
The job runs every 5 minutes, which means it will run 5-6 times during the actual meeting time, and we will check if the host is online every time. (Note: We should only check Google Calendar once, and not 5-6 times, keeping with our 30-minute cadence.)
Then, when the job runs the 7th time, the event is in the past now, but it is still the 3rd Tue (assuming the event started at 9 am). So we should schedule it for the 3rd Tue a month from now. If we schedule it for the next day, we risk spamming the calendar and annoying users.
As opposed to the deletion flow, where the user deliberately deletes the event, in which case we will reschedule for the next business day (3rd Wed, Thu, Fri, Mon, etc.)
Calendar Email Address:
I believe we'll attempt to get the user email addr from host_emails
but it seems a host could have multiple emails associated, which will we choose?
from the API docs for GET /api/v1/fleet/hosts/1/device_mapping :
{
"host_id": 1,
"device_mapping": [
{
"email": "[email protected]",
"source": "identity_provider"
},
{
"email": "[email protected]",
"source": "google_chrome_profiles"
},
{
"email": "[email protected]",
"source": "custom"
}
]
}
Is refetching the host before triggering the webhook a must?
- Refetching is an expensive (DB write) operation (if done at the same time by thousands of hosts).
- Policy results are refreshed every 1 hour.
How much of a common case is a user fixing its policies 1 hour before the calendar event? Is it a problem if a webhook is triggered on a healthy host (because the policies were fixed e.g. 30m before the calendar event)?
@getvictor
Is refetching the host before triggering the webhook a must?
- Refetching is an expensive (DB write) operation (if done at the same time by thousands of hosts).
- Policy results are refreshed every 1 hour.
How much of a common case is a user fixing its policies 1 hour before the calendar event? Is it a problem if a webhook is triggered on a healthy host (because the policies were fixed e.g. 30m before the calendar event)?
@getvictor
We believe it will be common for users to fix their policy issues right before the event starts. So the user must also know to refresh their host on the My Device page for the DB to be up to date.
If we fire webhook on a healthy host, then IT may restart their computer for no reason.
Another option is to adjust the policy refresh time ahead of time, so that it is done right before the event.
And then we can also look when policy results were refreshed, and not refetch if it was done within last 5-10 minutes.
It is true that if we have 10K+ hosts with 9am calendar events, then we may be overloaded. Perhaps we should spread out the calendar events?
cc: @noahtalerman
Plus refetching means (if we implement to refetch only policies):
Set a refetch flag for policies, then wait 10s (or whatever the distributed_interval may be, some customers have this be a higher value, e.g. 60s) for the host to have pushed the new results of the policy queries. So for 10k hosts at 10s distributed interval this job might take 27 hours? (Or if we do a refetch in batch it might be an expensive operation on the DB writer).
@noahtalerman If user modifies the event to an all-day event, we will treat it as deleted -- it is a bit of a hassle trying to support this since we need the user's timezone, and the user can change the timezone of their calendar without actually modifying the event.
@lucasmrod If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time? cc: @noahtalerman
If user modifies the event to an all-day event, we will treat it as deleted
@getvictor if I'm understanding, in this case, we'll schedule a new event on top of the all-day event?
Like this:
I think that's ok.
If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time?
@getvictor we could also treat it as deleted. What's easier?
If user modifies the event to an all-day event, we will treat it as deleted
@getvictor if I'm understanding, in this case, we'll schedule a new event on top of the all-day event?
Like this:
I think that's ok.
We'll schedule the event for the next day from the originally scheduled time.
If the calendar event length is 0 minutes (same start/end time), can you consider it started if the start time is within the 5-minute cron time?
@getvictor we could also treat it as deleted. What's easier?
Treating it as deleted is easier. But on my calendar, a 0-minute event looks the same as a 30-minute event. So, it would seem confusing why another event was created if one is already on the schedule.
@noahtalerman @lucasmrod If the event has started, should we refetch it from Google calendar to make sure it wasn't moved/deleted at the last minute? Currently, we only refetch every 30 minutes.
Treating it as deleted is easier.
@getvictor I think we can treat it as deleted for now. As long as we're not closing the door on changing this behavior later.
If the event has started, should we refetch it from Google calendar to make sure it wasn't moved/deleted at the last minute?
@getvictor I think yes. Imagine Mike has an "oh shoot" moment 15 mins before the board meeting starts so he moves the event.
@noahtalerman @lucasmrod What if a person has multiple hosts on different teams? It seems like they should have 2 calendar events (but they can be simultaneous) because different webhooks need to be fired. Or do you prefer 1 event, and we fire 2+ webhooks in the backend? I think we can leave handling this for the next version, but we might want to do some DB changes/assumptions now.
What if a person has multiple hosts on different teams?
@getvictor I think we'll want one calendar event per user. One 30 minute downtime event to autoremediate all my devices.
From customer conversation:
- [ ] Update policy automations docs
- [ ] Permissions changes: TODO - add "Manage access" page changes to draft PR
@noahtalerman, reminder to update the docs w/ link to videos to set up Fleet in your calendar: https://www.loom.com/share/9fbdff2998be4877b95ec6702c6c062c?sid=6602e703-aa5a-450a-b092-b5d28eb6e311
- [ ] Update policy automations docs
TODO: Worth doing a scan of the policy automations docs to see if there's now outdated language (ex. reference to outdated UI elements).
TODO: What would happen if you enable calendar automations for a "No team" policy? Add something to GitOps that adds a global policy w/ automations enabled.
- [ ] Permissions changes: TODO - add "Manage access" page changes to draft PR
TODO: If the permissions for choosing which policies trigger calendar events is the same as choosing which policies fire tickets/create webhooks, then let's use the same line in the Manage access table. If the permissions are different, break out a new line.
@rachaelshaw when you get the chance, can you please take these on? Thanks!