globalping
globalping copied to clipboard
Internal metrics collection
For tasks like https://github.com/jsdelivr/globalping/issues/37 and many others we will need to know more about the probes.
Here is a list of data we would probably need to collect to build the next features:
- Accepted tests, in progress tests, finished tests, failed tests. Accepted/in progress probably just real-time value, while the rest are timeseries. Maybe even total and per type as well?
- CPU load and CPU cores available
- Uptime
- Other?
We should probably store them for up to 7 days in a timeseries DB. But note that whatever DB we choose it will need it to scale as in the future we will also support scheduled tests, like pinging the same target every minute and building a chart of performance over time per region.
How often would we collect CPU/mem stats? And how often would we pushe them to the API?
Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?
How often would we collect CPU/mem stats? And how often would we push them to the API?
Ideally every few seconds, e.g. every 10s. The more accurate is the data the more we can later do with tests routing between probes. No local buffering, collect and push immediately.
Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?
Probably how long probe was connected unless there is a better metric. We can technically measure this both from a probe and the API. So in this case its probably better to collect this data on the API level?
If the probe disconnected and connected with "ready" state after 5 seconds this means the probe had a downtime of 5s.
@MartinKolarik any idea how to reasonably store this data? Ideally we would keep it all under a single record, but since we need older data to expire, we can't do that. I'm not convinced storing each measurement seperately is a good idea either. Maybe we could group them by 24hr periods, per key?
gp:probe:stats:15-08-22
{
"cpu": [ { "date": "123", ... }, ... ],
"mem": [ { "date": "123", ... }, ... ]
}
I'm not sure how to record uptime/downtime.
We could in theory begin only with realtime data as part of websockets pings. So that the API would always have accurate info on CPU load.
But in any case if we use a time series DB the exact format will depend on their rules
Real-time-only can go into redis in whatever format... Historical data would depend on the selected storage, which likely won't be redis.
So to me it seems for now we need these real-time values:
- CPU load
- Available CPU cores
- In progress tests
Then we need to select a timeseries DB out of the 100 that exist now and start storing:
- Uptime per probe pushed by the API probably
- accepted tests/successful tests/failed tests
Since the real-time part only needs Redis we can implement that part first. @MartinKolarik what do you think?
Yes I agree, implement the first part now using only redis as that's fairly straightforward. The other part I'd postpone until after #176.