fleet Provide Munki issues for MacOS hosts

Goal

As a Munki administrator, add ability to see the most common Munki issues so that I can prioritize resolving the issues that impact the most hosts.

I also want to be able to see which macOS hosts have these issues so that I know which hosts need issues resolved.

Figma

https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=7647%3A273670

Tasks

Roles:

This card is visible to all user roles.
This card is only visible if "macOS" platform is selected.

1

[ ] Pull more data from the macadmins osquery extension munki_info table.
We want to gather and store "errors" and "warnings" in the Fleet database.
The following osquery query is used to obtain error information
name should be a unique row in the database with an associated id. We will want to retrieve all hosts that contain this munki issues name/id, so a pivot table will likely be necessary.

Errors: SELECT errors FROM munki_info WHERE errors != '';

Warnings: SELECT warnings FROM munki_info WHERE warnings !=’’;

These queries both returns one result even if there are multiple errors or warnings.
This result is a string in which a semi colon separates each error or warning.

2

[ ] Add new munki_issues array to the GET /api/v1/fleet/macadmins response with aggregated counts.
Continue to support filtering results by team_id. Example: GET /api/v1/fleet/macadmins?team_id=3

Example response

{
  "macadmins": {
    "counts_updated_at": "2022-08-01T05:09:44Z",
    "munki_issues": [
      {
        "id": 1,
        "name": "Could not retrieve managed install primary manifest",
        "type": "error", 
        "hosts_count": 2851
      },
      {
        "id": 2,
        "name": "Could not process item Figma for optional install. No pkginfo found in catalogs: release",
        "type": "warning", 
        "hosts_count": 1983
      },
      ...
    ],
    "munki_versions": [
      {
        "version": "5.5",
        "hosts_count": 8360
      },
      {
        "version": "5.4",
        "hosts_count": 1700
      },
      {
        "version": "5.3",
        "hosts_count": 400
      },
      {
        "version": "5.2.3",
        "hosts_count": 112
      },
      {
        "version": "5.2.2",
        "hosts_count": 50
      }
    ],
    "mobile_device_management_enrollment_status": {
      "enrolled_manual_hosts_count": 124,
      "enrolled_automated_hosts_count": 124,
      "unenrolled_hosts_count": 112
    }
  }
}

3

[ ] Add new munki_issues array to the GET /api/v1/fleet/hosts/{id}/macadmins response with details specific to that host.

Example response

{
  "macadmins": {
    "munki": {
      "version": "1.2.3"
    },
    "munki_issues": [
      {
        "id": 1,
        "name": "Could not retrieve managed install primary manifest",
        "type": "error", 
        "created_at": "2022-08-01T05:09:44Z"
      },
      {
        "id": 2,
        "name": "Could not process item Figma for optional install. No pkginfo found in catalogs: release",
        "type": "warning", 
        "created_at": "2022-08-01T05:09:44Z"
      },
      ...
    ],
    "mobile_device_management": {
      "enrollment_status": "Enrolled (automated)",
      "server_url": "http://some.url/mdm"
    }
  }
}

4

[ ] Add new munki_issue_id query filter to the GET /hosts endpoint.
This will allow us to display a list of hosts affected by a specific munki issue id.

Jul 28 '22 21:07 lukeheath

IC: Determine best data structure for this

Jul 29 '22 17:07 RachelElysia

@lukeheath TODO: Update timestamps to UTC and correct docs

Jul 29 '22 17:07 RachelElysia

@lukeheath

Add new munki_issue_id query filter to the GET /hosts endpoint.

I presume that - as for mdm_id in #6732 - when this filter is provided we will add a munki_issue top-level key for the corresponding issue id, right? The response would look something like:

{
  "hosts": [
    // ...
  ],
  "munki_issue": {
        "id": 1,
        "name": "Could not retrieve managed install primary manifest",
        "type": "error", 
        "hosts_count": 2851
  }
}

Aug 15 '22 15:08 mna

@mna Yes, thanks for calling that out. I've updated the specs to reflect adding the munki_issue object to the hosts response.

Aug 15 '22 15:08 lukeheath

@lukeheath @noahtalerman Couple non-blocking questions about this ticket:

Currently, when we receive results from the munki_info query and the version == "" (empty string), we delete (well, "soft-delete") the host's munki version information. I'm not sure if it's actually possible, but in the off chance where we receive version == "" along with some errors or warnings, should we a) delete any issues associated with that host, ignoring those errors/warnings or b) store those errors/warnings for that host, as if they were unrelated to the version part of the munki info?
When we do receive a non-empty version, and some errors/warnings, do we just add up those errors/warnings, or do they replace the previous set (i.e. delete any existing ones then insert the newly received ones)?
Kind of the same as the previous bullet, but specifically for when we receive a non-empty version and no error/warning, do we just clear up any errors/warnings associated with that host or do we keep existing ones untouched?

Aug 15 '22 19:08 mna

in the off chance where we receive version == "" along with some errors or warnings, should we a) delete any issues associated with that host, ignoring those errors/warnings or b) store those errors/warnings for that host, as if they were unrelated to the version part of the munki info?

@mna hmm, good question. I prefer option (b). This way, if there's an unknown issue with receiving the version, the errors/warnings will still display.

My guess is we delete the host's Munki version information because we assume that Munki is not installed if version is an empty string. Martin, do you know if this is the case?

When we do receive a non-empty version, and some errors/warnings, do we just add up those errors/warnings, or do they replace the previous set (i.e. delete any existing ones then insert the newly received ones)?

Error/warnings should replace the previous set.

My understanding is that if a Munki error/warning is resolved, Munki no longer reports the error/warning. We'd like Fleet to provide this report of the current Munki errors/warnings.

This way, a user can see that a Munki error/warning was removed for X hosts. This will help the user confirm that they resolved the Munki error/warning.

when we receive a non-empty version and no error/warning, do we just clear up any errors/warnings associated with that host or do we keep existing ones untouched?

If there's no errors/warnings, we clear up any error/warnings associated with the host.

Same "help the user confirm they resolved..." reasoning.

Aug 15 '22 20:08 noahtalerman

@noahtalerman

My guess is we delete the host's Munki version information because we assume that Munki is not installed if version is an empty string. Martin, do you know if this is the case?

Yes that's my understanding.

if a Munki error/warning is resolved, Munki no longer reports the error/warning. We'd like Fleet to provide this report of the current Munki errors/warnings. If there's no errors/warnings, we clear up any error/warnings associated with the host.

:+1: That's what I expected but just wanted to make sure. Makes sense.

Aug 15 '22 20:08 mna

I have some concerns about performance, especially around ensuring the errors/warnings messages have been created and loading their ids when receiving a host's munki info with multiple errors/warnings. I'll add a load testing step to the ticket.

Aug 16 '22 13:08 mna

@lukeheath @noahtalerman Couple more non-blocking questions:

If we did receive the same message once as an "error" and once as a "warning", do we want to consider it a single munki issue (same ID, stored with whatever issue type was reported first) or different ones?
Do we have an idea of how long the messages can be? I couldn't find this information in the macadmins munki_info table implementation, I can take a deeper look into the actual munki implementation if we're unsure, at the moment I created the table with a limit of 255 which seemed reasonable based on example messages in the ticket/figma. Related to that - let's say that the messages can be arbitrarily long - the consideration would now become how much of the message do we want to keep (it probably makes sense to truncate it at 255 or whatever size we deem enough if we ever get something bigger).

Aug 16 '22 13:08 mna

If we did receive the same message once as an "error" and once as a "warning", do we want to consider it a single munki issue

I prefer to consider 1 error and 1 warning with same message as 2 different Munki issues in Fleet (different IDs). This is because we'd like to prioritize data accuracy. This means Fleet reports what is reported by Munki.

This is an interesting case. Are you running into this while developing the feature?

Do we have an idea of how long the messages can be?

I'm not sure. I think taking a deeper look into the Munki implementation would be very helpful. Martin, when you get the chance, can you check this out? If it's helpful, I'm happy to hop on a call to dig into this.

I would prefer to never truncate the message. This way, we can iterate on the Fleet UI to support longer messages if these are common. Are there performance concerns with not truncating?

Aug 16 '22 13:08 noahtalerman

@mna, I forgot to @ mention you in the above message^

Aug 16 '22 13:08 noahtalerman

@noahtalerman Thanks for the clarifications!

I prefer to consider 1 error and 1 warning with same message as 2 different Munki issues

:+1: sounds good.

Are you running into this while developing the feature?

No, this is a theoretical case (that we still have to consider) as I have not seen "real" munki data yet (this will likely be a challenge too as I'm on linux and this is mac-only, will likely need some assistance closer to the end of the ticket to get some real data for further testing - fleetctl preview enrolls linux hosts only too).

I think taking a deeper look into the Munki implementation would be very helpful

Sure thing, I'll take a look.

I would prefer to never truncate the message. This way, we can iterate on the Fleet UI to support longer messages if these are common. Are there performance concerns with not truncating?

Yes, there are performance concerns in terms of both payload size and database performance, as the message needs to have a unique index (well, it will be unique per message + issue type). There are size constraints to unique indexes, and even without constraints, there are performance issues when the indexed data is too big. One option to work around this is to store and index a hash of the string instead of the string itself, which adds complexity around the code logic but is something to consider if we have to. That being said, I'm not sure how valuable a big paragraph of text can be when looking at hosts count and list of hosts with that message, coupled with how likely it is that such long strings could have a few different characters towards the end?

But this is definitely a performance-sensitive feature - each host can potentially have a large number of errors+warnings, and the data itself is relatively big (long-ish strings).

Aug 16 '22 14:08 mna

@noahtalerman Took a quick peek at the Munki source code. Looks like everything that calls display_warning or display_error ends up in the Warnings and Errors fields: https://github.com/munki/munki/blob/main/code/client/munkilib/display.py#L206-L237

And this gets called from a number of places in the source code, but this is one example that isn't very encouraging with regards to message size: https://github.com/munki/munki/blob/main/code/client/munkilib/appleupdates/su_tool.py#L279

I'm not fluent in python, but basically any output that comes out from calling softwareupdate is appended to the prefix that munki adds. (another thing to keep in mind is that there could be semicolons in the messages, meaning that when it gets bundled into the osquery munki_info table with all errors joined together by a semicolon in a single big string, there's no way we can guarantee the messages are not broken when we split them back - but this is probably a minor issue compared to the size/performance one).

If we really want to store large strings, then we'll need the hash approach for the unique index, but we'd still need to cap that string value even if it is a very large maximum (like 2KB or 4KB, something like that is probably reasonable as maximum? I mean, we don't want to store an 1MB message). But storing excessively large values in a column often means using special processing by the DB engine, that is less efficient.

Aug 16 '22 14:08 mna

@lukeheath @noahtalerman In any case for now I'll implement it with a 255 chars limit and load-test the feature - I'm worried that even with smaller messages, we could run into some issues at scale and may need some fine-tuning/caching/etc. Hopefully that's not the case, but there's definitely a risk - this potentially adds significant more data to ingest coming from hosts. On that front, do we have a rough idea of how many issues munki may report with this mechanism (maybe based on how many apps are installed), to run realistic load tests? No rush on that, this won't be ready to load test before some time next week.

Otherwise as a guess, I'd think 1 message (warning or error) for 50% of the software installed per host as a kind of "realistic worse case" (maybe for a certain percentage of hosts, 50-80%).

Aug 17 '22 12:08 mna

@mna thank you for digging into the Munki code.

I'll implement it with a 255 chars limit and load-test the feature

Sounds good 👍

do we have a rough idea of how many issues munki may report with this mechanism

I'm not sure. I'll ask the customer that is requesting this feature and follow up in this issue with the answer.

keep in mind is that there could be semicolons in the messages, meaning that when it gets bundled into the osquery munki_info table with all errors joined together by a semicolon in a single big string, there's no way we can guarantee the messages are not broken when we split them back - but this is probably a minor issue compared to the size/performance one

Thank you for checking this. This is great to be aware of.

I agree that this is a minor issue because I don't think semicolons are common characters in Munki errors/warning.

I think it makes sense to come back to this later. That is, if I'm wrong and semicolons are common characters in Munki warning/errors.

Aug 17 '22 17:08 noahtalerman

Ran this (the current implementation is with messages of 255 chars max, with the unique index on the name + issue_type - name is the full message) in load testing environment:

first with 60K hosts where ~50% report having munki installed and 50% of those report 10 munki issues (warnings and errors mixed randomly)
then with 100K hosts, again with 50% having munki and 50% of those reporting 10 issues

Everything looks good regarding the DB metrics and the general behaviour of the website, here are screenshots from the 100K hosts run (note that no munki issues-related SQL statements show up in the Top SQL):

FleetDM-Dashboard-100K-Hosts

We can see here in the developer tools window that this host did report some munki issues: FleetDM-Host-Reporting-MunkiIssues

RDS mysql metrics: RDS-100K-1h-CounterMetrics RDS-100K-1h-DBLoad RDS-100K-1h-TopSQL

Redis metrics, generally irrelevant in this scenario but included for completeness' sake: Redis-100K-1h-Metrics

Here I curl'd the aggregates stats endpoint (as the frontend is not implemented yet): curl-macadmins-aggregate-stats

And finally, a test running a live query, it did run fine and reported results quickly: RunningLiveQuery-WorksFine

Aug 23 '22 15:08 mna

fleet fleet copied to clipboard

Provide Munki issues for MacOS hosts

Goal

Figma

Related

Tasks

1

2

Example response

3

Example response

4

fleet
fleet copied to clipboard