fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Best practice Fleet upgrades: taking the server down

Open lucasmrod opened this issue 1 year ago • 6 comments

Goal

User story
As a Fleet administrator who works w/ other stakeholders who use Fleet,
I want to understand how taking the Fleet server down during upgrades affects my stakeholder's downstream use cases
so that I can follow Fleet's best practice: take the Fleet server down during upgrades.

Changes

Product

  • [x] Google Doc that has instructions on the following:
    1. How to configure the fleetd agent so that there's no data loss when taking the Fleet server down.
    • The customer is concerned about losing result logs because they don't know if their hosts will hit the buffer limit. The customer gave us the average amount of data that's collected for each host over 30 mins. This is the time we estimate that the Fleet server will be down.
    1. How to configure the Fleet server so that it doesn't fall over when the server is brought back online.
    • If the customer's configuration works, then we just say we load tested and confirmed. If the configuration doesn't work, add instructions on how to configure the Fleet server so that it performs well.
    • The customer is concern about Fleet server's performance when all agents start to check-in and send logs (thundering herd)
    1. How to suppress fleetd agent logs/errors when the Fleet server down.
    • We already built an experimental feature to suppress logs for this customer (FLEETD_SILENCE_ENROLL_ERROR). If we need to make changes, try to use this existing variable.
    • The customer is concerned about the stakeholders, who have workflows that ingest errors/logs from the host, will got a lot of noise.
  • [x] fleetd changes: Update the existing FLEETD_SILENCE_ENROLL_ERROR environment variable to suppress fleetd (Orbit and osquery) logs/errors when the Fleet server is down.

Engineering

  • [ ] Database schema migrations: TODO
  • [ ] Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

  • Requestor(s): _________________________

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.

lucasmrod avatar Jan 29 '24 19:01 lucasmrod

TODO: improve this ticket.

sharon-fdm avatar Jan 29 '24 19:01 sharon-fdm

What we want: 1 - make sure the server can start after 30 minutes of down time 2 - make sure no data loss on agent

sharon-fdm avatar Jan 29 '24 19:01 sharon-fdm

Original issue description is here:

Goal

User story
As a Fleet user I would like to determine the impact of Fleet downtime on my deployment and the results of my scheduled queries

1 - make sure the server can start after 30 minutes of down time. How does Fleet perform? (thundering heard) 2 - make sure no data loss on agent. Is the agent able to buffer all the data? 3 - for hosts that try to enroll when the Fleet server is down, how often does the fleetd agent log errors? (status logs) Every minute? Every hour? This way, the customer can set expectations for partner teams on how often they'll see alerts.

noahtalerman avatar Feb 05 '24 17:02 noahtalerman

I'm working on the recommendations here (WIP): https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit

lucasmrod avatar Feb 06 '24 16:02 lucasmrod

Recommendations document for configuring osquery and Fleet to prevent data loss: https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit

lucasmrod avatar Feb 16 '24 14:02 lucasmrod

Deliverables

Fleet's downtime load test

https://docs.google.com/document/d/1J3iF_9ayZtUcrSVqffDeRtgUMgJmJbhfS8H_qRq4ggM/edit

Recommendations for osquery configuration

https://docs.google.com/document/d/1gDbV34VKohU6nzFBYJrtfPfcAWiZK-Smk-1JWCuY3Zs/edit#heading=h.lx4iybtnpj6s

fleetd

The fleetd changes to reduce log noise during Fleet's downtime have been released with orbit 1.22.0.

lucasmrod avatar Mar 04 '24 16:03 lucasmrod

@lucasmrod I read and approve both docs above. Please make sure these documents reach customer success ( @ksatter and/or @Patagonia121 ). You can close the ticket once the customer gets the docs (no need for QA).

sharon-fdm avatar Mar 05 '24 21:03 sharon-fdm

Results were shared with the customer. Closing.

sharon-fdm avatar Mar 06 '24 16:03 sharon-fdm

Server down, no fear, Fleet upgrades flow like a stream, Silence, peace restored.

fleet-release avatar Mar 06 '24 16:03 fleet-release

@sharon-fdm This should go to the "Ready for Release" column so it can be closed out through our normal ritual. Re-opening.

lukeheath avatar Mar 11 '24 16:03 lukeheath

As per Sharon's comment here, the results were shared w/ the customer.

Public Google doc is at the top of the issue description.

noahtalerman avatar Mar 14 '24 18:03 noahtalerman

Server's downtime pause, No data loss, silence calls, Fleet upgrades with cause.

fleet-release avatar Mar 14 '24 18:03 fleet-release