gatus Pagination and sorting by failed endpoints for large dashboards

I'm probably pushing this a bit far - but why not.

I'm trying to monitor ~3500 services, which group together into ~450 groups. At this scale the UI and API are unusable, and requests take ~2.5 minutes to complete.

It'd be great to be able to do something simple to try and alleviate the burden here - I think the actual tests are working fine given the concurrency of go, and I have disabled the locking mechanism, but the UI is holding it back. This was tested with the sqlite backend (I'm currently testing the memory backend, but it takes a while to start testing the first of each service with so many to start).

Maybe these are some options to consider:

Pagination
A setting to display only services which are down (or down within X period of time)
Less comprehensive landing page (i.e. less data to render) which only displays the state currently with no history until you click that service
Group based landing page which shows only the state of the group as an aggregated entity, instead of digging into individual services

Feel free to tell me that this is way beyond the scale gatus was ever designed for, or ever planned to support.

Aug 10 '21 23:08 dchidell

3500?! 🤣

That's far beyond what I thought anybody would ever use Gatus for, and I can see why it the UI would be unusable. Even just the API alone must return an insanely large response. In fact, the pagination built around the API just takes in consideration the number of results to return per service, not the actual number of services to return.

Even if we just looked at the results per groups, 450 is quite a lot. I don't know if your tests are very similar to each other, but at that scale, perhaps splitting the workload between several Gatus instance would make more sense.

That being said, I do think that it would be nice to have ability to only expose the groups rather than the groups and their underlying services would be nice. In fact, #149 mentions something similar-ish.

What I have in mind would be something like this:

where it would be possible to click on the service group and list all the individual services underneath it, but through a separate API call rather than using a single call to retrieve everything.

There's obviously a couple of things to think about as well:

What about users who have <10 service groups, where it makes sense to retrieve everything in one go?
Would it be a good idea to have a page dedicated to a single service group, similarly to how there are pages for each individual service (example)
Does it make sense to assume that a lot of users will have this many services? While it's not uncommon to have hundreds of tests, when it gets to the thousands, looking at a dashboard with so many entries feels a little bit too overwhelming, in which case it would make more sense to rely on alerting to get notifications rather than look at the dashboard.

Aug 11 '21 00:08 TwiN

After leaving it overnight to catch up, when using the memory storage type, the UI responds within a few seconds (rather than minutes). For the application I have, memory storage is probably fine, but the UI is still too cluttered.

All of the tests are identical, each is a single HTTP endpoint, the use-case is a scaled application distributed across multiple servers with multiple health check endpoints on each server reverse-proxied to backend services. A simple 200 response is sufficient.

Definitely think exposing groups only is an excellent way forward here, and the UI layout you've drafted looks like a great way to represent that, providing as you said, it's a separate API call.

In response to your other points:

This could be a global config setting, if the top tier view is either group based or service based.
Yes, I think so, it might even be desirable to not bother collecting the detailed metrics for each service, and collect information only for the group as a whole (with a setting to correspond with that). This would probably be pretty close to #149 by the end (as the group would represent an entity, and a service would represent a check within that entity, it's just a different hierachy structure to accomplish the same thing).
Definitely looking at the dashboard is overwhelming, but the fact gatus is able to even poll this many services reliably (with surprisingly little load on the server running it and using a 1 minute interval) goes to show that it's a really capable tool, and only UI changes are needed to allow it to scale up massively.

Aug 11 '21 10:08 dchidell

I've given this some more thought, and typically the only interesting services are those experiencing problems.

What about adding a configuration option which conditionalises the displayed services in the UI, e.g. if the uptime in a 24 hour period is < 90% then show it in the UI.

This could be a service-level configuration in the same format as the existing health check rules. For example (see "display_conditons":

metrics: true         # Whether to expose metrics at /metrics
services:
  - name: twinnation  # Name of your service, can be anything
    url: "https://twinnation.org/health"
    interval: 30s     # Duration to wait between every status check (default: 60s)
    conditions:
      - "[STATUS] == 200"         # Status must be 200
      - "[BODY].status == UP"     # The json path "$.status" must be equal to UP
      - "[RESPONSE_TIME] < 300"   # Response time must be under 300ms
    display_conditions:
      - "[STATUS] == UNHEALTHY"
  - name: example
    url: "https://example.org/"
    interval: 5m
    conditions:
      - "[STATUS] == 200"
    display_conditions:
      - "[UPTIME].24h < 0.9"

Aug 26 '21 15:08 dchidell

@dchidell What do you mean by this?:

I've given this some more thought, and typically the only interesting services are those experiencing problems.

But anyways, I was thinking of simply adding support for paging. That, coupled with maybe some kind of optional sorting capability at the API level to enable things like surfacing failing services first, should be sufficient.

I don't know when I'll work on that though. I was hoping maybe I could manage it before v3.0.0 -- but we'll see.

Aug 28 '21 03:08 TwiN

Quick update on the above ^

I released v3.0.0 today, and while paging hasn't been implemented yet, I did make the breaking change necessary to allow paging.

Before, /api/v1/statuses returned a map of service statuses:

{
  "group1_service1": {},
  "group1_service2": {},
  "group2_service1": {}
}

but now, /api/v1/services/statuses, the replacement of the aforementioned endpoint, returns an array of service statuses:

[
  {
    "key": "group1_service1"
  },
  {
    "key": "group1_service2"
  },
  {
    "key": "group2_service1"
  }
]

This was the breaking change that was needed for paging.

I'm thinking of employing a slightly more unusual approach to paging, but before I get into my idea, let me talk about the problem.

The problem

Let's say we implement paging; we want 20 services per page.

In total, you have 38 services split into 5 groups:

group1 has 4 services
group2 has 7 services
group3 has 10 services
group4 has 15 services
group5 has 2 services

So now what do you do for your first page? The first 20 services would have all services from group1and group2, but you'd only have 9 out of 10 of the services from group3.

The solution

I think that the most transparent solution would be to implement paging that revolves around the number of groups that can be shown in one page.

I'm thinking that 10 groups per page would be a good default value, and you may be thinking "that's way too much", and that's on purpose.

Most people will have less than 10 groups on their dashboard, which means that it won't impact them, but for people like you who have thousands of services, this means that you can willingly use groups for the sake of pagination.

Caveats

This will most likely make #35 harder to implement. I initially wanted to implement #35 by using just the front end for filtering, but if we add pagination for the main dashboard, it means that the filtering would need to be done on the back-end.

Sep 06 '21 20:09 TwiN

@dchidell I've finally decided to look into improving the performance yesterday (couldn't reply until now due to work). I'm not sure if you still have ~3500 endpoints, but if you do, I have a couple questions for you.

Does all of your endpoints have the same interval, or do they differ? If they do differ, could you elaborate? For instance, 50% of your endpoints with an interval of 10m and 50% of your endpoints with an interval of 1h.

The best solution would likely be to fine-tune the SQL queries, but I'm considering adding a smart layer of cache on top of the persistence layer. It wouldn't really help for infrequently visited dashboards, but for frequently visited dashboard, it's a significant improvement.

Aug 02 '22 22:08 TwiN

We have more now, we're constantly growing! But the order of magnitude is still the same.

The endpoints are all the same interval, and all would be HTTP. Essentially we have ~1000 machines each running ~5 different docker based web services. The 5 web services are all very similar to each other, and are the same across the machines. The interval would ideally be as low as possible, to detect failures whenever they happen. Ideally 1 minute. Providing gatus can make use of multiple cores / threads it can quite happily be run on a dedicated machine with over 100 cores.

Aug 03 '22 11:08 dchidell

@dchidell I've just pushed the image twinproduction/gatus:experimental built from #314. It's not in the latest branch yet because I'd like if somebody properly tested it first.

The first time you load the dashboard will be a bit slow, but the subsequent ones should be significantly faster.

Aug 12 '22 01:08 TwiN

gatus gatus copied to clipboard

Pagination and sorting by failed endpoints for large dashboards

The problem

The solution

Caveats

gatus
gatus copied to clipboard