globalping icon indicating copy to clipboard operation
globalping copied to clipboard

Internal metrics collection

Open jimaek opened this issue 2 years ago • 7 comments

For tasks like https://github.com/jsdelivr/globalping/issues/37 and many others we will need to know more about the probes.

Here is a list of data we would probably need to collect to build the next features:

  • Accepted tests, in progress tests, finished tests, failed tests. Accepted/in progress probably just real-time value, while the rest are timeseries. Maybe even total and per type as well?
  • CPU load and CPU cores available
  • Uptime
  • Other?

We should probably store them for up to 7 days in a timeseries DB. But note that whatever DB we choose it will need it to scale as in the future we will also support scheduled tests, like pinging the same target every minute and building a chart of performance over time per region.

jimaek avatar Aug 13 '22 15:08 jimaek

How often would we collect CPU/mem stats? And how often would we pushe them to the API?

Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?

patrykcieszkowski avatar Aug 15 '22 15:08 patrykcieszkowski

How often would we collect CPU/mem stats? And how often would we push them to the API?

Ideally every few seconds, e.g. every 10s. The more accurate is the data the more we can later do with tests routing between probes. No local buffering, collect and push immediately.

Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?

Probably how long probe was connected unless there is a better metric. We can technically measure this both from a probe and the API. So in this case its probably better to collect this data on the API level?

If the probe disconnected and connected with "ready" state after 5 seconds this means the probe had a downtime of 5s.

jimaek avatar Aug 15 '22 15:08 jimaek

@MartinKolarik any idea how to reasonably store this data? Ideally we would keep it all under a single record, but since we need older data to expire, we can't do that. I'm not convinced storing each measurement seperately is a good idea either. Maybe we could group them by 24hr periods, per key?

gp:probe:stats:15-08-22
{
    "cpu": [ { "date": "123", ... }, ... ],
    "mem": [ { "date": "123", ... }, ... ]
}

I'm not sure how to record uptime/downtime.

patrykcieszkowski avatar Aug 15 '22 16:08 patrykcieszkowski

We could in theory begin only with realtime data as part of websockets pings. So that the API would always have accurate info on CPU load.

But in any case if we use a time series DB the exact format will depend on their rules

jimaek avatar Aug 15 '22 16:08 jimaek

Real-time-only can go into redis in whatever format... Historical data would depend on the selected storage, which likely won't be redis.

MartinKolarik avatar Aug 15 '22 16:08 MartinKolarik

So to me it seems for now we need these real-time values:

  • CPU load
  • Available CPU cores
  • In progress tests

Then we need to select a timeseries DB out of the 100 that exist now and start storing:

  • Uptime per probe pushed by the API probably
  • accepted tests/successful tests/failed tests

Since the real-time part only needs Redis we can implement that part first. @MartinKolarik what do you think?

jimaek avatar Aug 15 '22 16:08 jimaek

Yes I agree, implement the first part now using only redis as that's fairly straightforward. The other part I'd postpone until after #176.

MartinKolarik avatar Aug 15 '22 16:08 MartinKolarik