Goal

User story
As a Fleet administrator who works w/ other stakeholders who use Fleet,
I want to understand how taking the Fleet server down during upgrades affects my stakeholder's downstream use cases
so that I can follow Fleet's best practice: take the Fleet server down during upgrades.

Changes

Product

[x] Google Doc that has instructions on the following:
1. How to configure the fleetd agent so that there's no data loss when taking the Fleet server down.
- The customer is concerned about losing result logs because they don't know if their hosts will hit the buffer limit. The customer gave us the average amount of data that's collected for each host over 30 mins. This is the time we estimate that the Fleet server will be down.
1. How to configure the Fleet server so that it doesn't fall over when the server is brought back online.
- If the customer's configuration works, then we just say we load tested and confirmed. If the configuration doesn't work, add instructions on how to configure the Fleet server so that it performs well.
- The customer is concern about Fleet server's performance when all agents start to check-in and send logs (thundering herd)
1. How to suppress fleetd agent logs/errors when the Fleet server down.
- We already built an experimental feature to suppress logs for this customer (FLEETD_SILENCE_ENROLL_ERROR). If we need to make changes, try to use this existing variable.
- The customer is concerned about the stakeholders, who have workflows that ingest errors/logs from the host, will got a lot of noise.
[x] fleetd changes: Update the existing FLEETD_SILENCE_ENROLL_ERROR environment variable to suppress fleetd (Orbit and osquery) logs/errors when the Fleet server is down.

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

Requestor(s): _________________________

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

Jan 29 '24 19:01 lucasmrod

TODO: improve this ticket.

Jan 29 '24 19:01 sharon-fdm

What we want: 1 - make sure the server can start after 30 minutes of down time 2 - make sure no data loss on agent

Jan 29 '24 19:01 sharon-fdm

Original issue description is here:

Goal

User story
As a Fleet user I would like to determine the impact of Fleet downtime on my deployment and the results of my scheduled queries

1 - make sure the server can start after 30 minutes of down time. How does Fleet perform? (thundering heard) 2 - make sure no data loss on agent. Is the agent able to buffer all the data? 3 - for hosts that try to enroll when the Fleet server is down, how often does the fleetd agent log errors? (status logs) Every minute? Every hour? This way, the customer can set expectations for partner teams on how often they'll see alerts.

Feb 05 '24 17:02 noahtalerman

I'm working on the recommendations here (WIP): https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit

Feb 06 '24 16:02 lucasmrod

Recommendations document for configuring osquery and Fleet to prevent data loss: https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit

Feb 16 '24 14:02 lucasmrod

Deliverables

Fleet's downtime load test

https://docs.google.com/document/d/1J3iF_9ayZtUcrSVqffDeRtgUMgJmJbhfS8H_qRq4ggM/edit

Recommendations for osquery configuration

https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit#heading=h.lx4iybtnpj6s

fleetd

The fleetd changes to reduce log noise during Fleet's downtime have been released with orbit 1.22.0.

Mar 04 '24 16:03 lucasmrod

@lucasmrod I read and approve both docs above. Please make sure these documents reach customer success ( @ksatter and/or @Patagonia121 ). You can close the ticket once the customer gets the docs (no need for QA).

Mar 05 '24 21:03 sharon-fdm

Results were shared with the customer. Closing.

Mar 06 '24 16:03 sharon-fdm

Server down, no fear, Fleet upgrades flow like a stream, Silence, peace restored.

Mar 06 '24 16:03 fleet-release

@sharon-fdm This should go to the "Ready for Release" column so it can be closed out through our normal ritual. Re-opening.

Mar 11 '24 16:03 lukeheath

As per Sharon's comment here, the results were shared w/ the customer.

Public Google doc is at the top of the issue description.

Mar 14 '24 18:03 noahtalerman

Server's downtime pause, No data loss, silence calls, Fleet upgrades with cause.

Mar 14 '24 18:03 fleet-release

fleet
fleet copied to clipboard

Best practice Fleet upgrades: taking the server down

Goal

Changes

Product

Engineering

Context

QA

Risk assessment

Manual testing steps

Testing notes

Confirmation

Goal

Deliverables

Fleet's downtime load test

Recommendations for osquery configuration

fleetd

fleet fleet copied to clipboard

Best practice Fleet upgrades: taking the server down

Goal

Changes

Product

Engineering

Context

QA

Risk assessment

Manual testing steps

Testing notes

Confirmation

Goal

Deliverables

Fleet's downtime load test

Recommendations for osquery configuration

fleetd

fleet
fleet copied to clipboard