fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Add fleetd tables to query fleetd logs

Open dantecatalfamo opened this issue 1 year ago • 7 comments

Goal

User story
As a Fleet contributor debugging fleetd issues,
I want to ask a customer to run a query to collect fleetd (Orbit, Fleet Desktop, or osquery) logs
so that I can get logs w/o asking the customer to do file carves or contact the end user to send them log files.

Context

  • Requestor(s): @dantecatalfamo
  • Product designer: @noahtalerman

Changes

Engineering

  • [ ] Database schema migrations: TODO
  • [ ] Load testing: TODO
  • [ ] Write documentation.
    • Could be contributor docs, but link to it from the agent docs for folks who have their own update server.

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.

dantecatalfamo avatar Apr 12 '24 15:04 dantecatalfamo

Thanks for tracking this @dantecatalfamo!

Looks like this one require a product change (new fleetd table) so let's bring it through feature fest.

cc @lukeheath

noahtalerman avatar Apr 18 '24 14:04 noahtalerman

Hey @dantecatalfamo, now that this story is in the current design sprint, I updated this issue to use the user story format.

Let me know if you have any feedback on the user story.

I moved your original issue description here for safe keeping:

Problem

Currently when there is a bug with fleetd, we rely on file carving to retrieve the information from clients, which is both slow and not always possible based on the platform and launch options.

Potential solutions

I propose an internal logging ring buffer that we write to at the same time as the log file and stdout. This ring buffer could be exposed using a virtual table so that it could be retrieved easily using osquery, with the possibility to filter the lines using WHERE LIKE clauses to limit the number of results returned.

This internal ring buffer could be limited to in size to 10,000 entries to limit the possible memory usage impacts.

This would be a huge help in debugging clients in the wild, where access to logs is not guaranteed, and often cumbersome even when possible. This solution also better preserves the privacy of clients, as it doesn't rely on reading arbitrary files from the operating system.

A limitation to this approach is that it relies on the logs being stored in memory, so only the logs produced during the current session will be available, limiting it's usefulness for bugs where the entire client crashes. Most bugs are not in this category however.

Instead of writing logs to an additional location could we add a table that can query the existing log files? I'm thinking something akin to the file_lines table specifically for traversing fleetd log files.

That way, we avoid storing the logs in two separate places so there's one best practice location to look when debugging fleetd.

noahtalerman avatar Apr 21 '24 19:04 noahtalerman

Hi @noahtalerman,

Yes there are some reasons I don't want to query the existing log files.

  1. Some hosts don't write their own logs. Linux hosts will often write logs to stdout and let the system logger aggregate them
  2. Writing them to a DB will let us use a proper timestamp column, allowing us more easily do things like check for logs within a period, for example the last 2 hours
  3. Some log messages are multiline, if a message is 4 lines long and part of it matches, we want the entire log message and not just the line that contains the match
  4. Because log messages can be multiline, not all lines will begin with a timestamp. We won't be able to reliably parse the timestamp at the beginning of the line

Writing to a sqlite database would be persistent alternative to a ring buffer as well

dantecatalfamo avatar Apr 22 '24 16:04 dantecatalfamo

@dantecatalfamo We're bringing this one back to feature fest. We didn't get to it in the current design sprint.

marko-lisica avatar May 09 '24 15:05 marko-lisica

On second thought, I think this story meets the definition of an engineering initiated story.

@lukeheath heads up, I'm removing ~feature fest and adding engineering-initiated.

noahtalerman avatar May 09 '24 18:05 noahtalerman

FYI @dantecatalfamo ^^

noahtalerman avatar May 09 '24 18:05 noahtalerman

@dantecatalfamo Thanks for filing this! This seems like an easy customer service win (@Patagonia121 @nonpunctual) so I'm prioritizing for estimation and to be considered for the next sprint.

lukeheath avatar May 09 '24 21:05 lukeheath

Hey team! Please add your planning poker estimate with Zenhub @gillespi314 @ghernandez345 @roperzh @mna @jahzielv

georgekarrv avatar May 29 '24 16:05 georgekarrv

I validated the table was working as expected. Planned for release with Fleetd 1.26.0.

xpkoala avatar Jun 12 '24 16:06 xpkoala

Debug, no hassle, Queries collect fleetd logs, Work flows like a castle.

fleet-release avatar Jun 12 '24 22:06 fleet-release

Debugging's made light With fleetd tables in hand, Cloud logs in plain sight.

fleet-release avatar Jun 14 '24 00:06 fleet-release

Sorry that was a GitHub iPad app booboo. Did not mean to reopen

nonpunctual avatar Jun 14 '24 00:06 nonpunctual