fleet
fleet copied to clipboard
Best practice Fleet upgrades: taking the server down
Goal
User story |
---|
As a Fleet administrator who works w/ other stakeholders who use Fleet, |
I want to understand how taking the Fleet server down during upgrades affects my stakeholder's downstream use cases |
so that I can follow Fleet's best practice: take the Fleet server down during upgrades. |
Changes
Product
- [x] Google Doc that has instructions on the following:
- How to configure the fleetd agent so that there's no data loss when taking the Fleet server down.
- The customer is concerned about losing result logs because they don't know if their hosts will hit the buffer limit. The customer gave us the average amount of data that's collected for each host over 30 mins. This is the time we estimate that the Fleet server will be down.
- How to configure the Fleet server so that it doesn't fall over when the server is brought back online.
- If the customer's configuration works, then we just say we load tested and confirmed. If the configuration doesn't work, add instructions on how to configure the Fleet server so that it performs well.
- The customer is concern about Fleet server's performance when all agents start to check-in and send logs (thundering herd)
- How to suppress fleetd agent logs/errors when the Fleet server down.
- We already built an experimental feature to suppress logs for this customer (
FLEETD_SILENCE_ENROLL_ERROR
). If we need to make changes, try to use this existing variable. - The customer is concerned about the stakeholders, who have workflows that ingest errors/logs from the host, will got a lot of noise.
- [x] fleetd changes: Update the existing
FLEETD_SILENCE_ENROLL_ERROR
environment variable to suppress fleetd (Orbit and osquery) logs/errors when the Fleet server is down.
Engineering
- [ ] Database schema migrations: TODO
- [ ] Load testing: TODO
ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".
Context
- Requestor(s): _________________________
QA
Risk assessment
- Requires load testing: TODO
- Risk level: Low / High TODO
- Risk description: TODO
Manual testing steps
- Step 1
- Step 2
- Step 3
Testing notes
Confirmation
- [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
- [ ] QA (@____): Added comment to user story confirming successful completion of QA.
TODO: improve this ticket.
What we want: 1 - make sure the server can start after 30 minutes of down time 2 - make sure no data loss on agent
Original issue description is here:
Goal
User story |
---|
As a Fleet user I would like to determine the impact of Fleet downtime on my deployment and the results of my scheduled queries |
1 - make sure the server can start after 30 minutes of down time. How does Fleet perform? (thundering heard) 2 - make sure no data loss on agent. Is the agent able to buffer all the data? 3 - for hosts that try to enroll when the Fleet server is down, how often does the fleetd agent log errors? (status logs) Every minute? Every hour? This way, the customer can set expectations for partner teams on how often they'll see alerts.
I'm working on the recommendations here (WIP): https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit
Recommendations document for configuring osquery and Fleet to prevent data loss: https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit
Deliverables
Fleet's downtime load test
https://docs.google.com/document/d/1J3iF_9ayZtUcrSVqffDeRtgUMgJmJbhfS8H_qRq4ggM/edit
Recommendations for osquery configuration
https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit#heading=h.lx4iybtnpj6s
fleetd
The fleetd changes to reduce log noise during Fleet's downtime have been released with orbit 1.22.0.
@lucasmrod I read and approve both docs above. Please make sure these documents reach customer success ( @ksatter and/or @Patagonia121 ). You can close the ticket once the customer gets the docs (no need for QA).
Results were shared with the customer. Closing.
Server down, no fear, Fleet upgrades flow like a stream, Silence, peace restored.
@sharon-fdm This should go to the "Ready for Release" column so it can be closed out through our normal ritual. Re-opening.
As per Sharon's comment here, the results were shared w/ the customer.
Public Google doc is at the top of the issue description.
Server's downtime pause, No data loss, silence calls, Fleet upgrades with cause.