Extend Frequency range of checks from 120 sec max to hours and longer periods
What would you like to be added: Extend Frequency range of checks from 120 sec max to hours and longer periods
Why is this needed: Sometimes we need more rare checks, like 1 per 10 mins, or 1 per hour, and even one per day, but Synthetic Monitoring application interface allow to set Frequency only from 10 to 120 seconds.
Hi @MurzNN, could you please provide some concrete examples of these use cases so that we can figure out how to implement something like this?
Right now the checks are run at a relative high frequency (compared to this request) because the metrics need to be kept alive. If we reduce the check frequency to something like 1 hour, the metrics would go stale. One possible way to fix this is to run the actual check every 1 hour but push the metrics every 2 minutes (using the most recently available value). This requires some changes in the way the agent works (grafana/synthetic-monitoring-agent).
I need to check correct working (and execution time) of some expensive tasks, which gives much load to server, so executing them every 1-2 minutes will be too high load.
On Mon, 21 Jun 2021, 17:00 Marcelo Magallon, @.***> wrote:
Hi @MurzNN https://github.com/MurzNN, could you please provide some concrete examples of these use cases so that we can figure out how to implement something like this?
Right now the checks are run at a relative high frequency (compared to this request) because the metrics need to be kept alive. If we reduce the check frequency to something like 1 hour, the metrics would go stale. One possible way to fix this is to run the actual check every 1 hour but push the metrics every 2 minutes (using the most recently available value). This requires some changes in the way the agent works (grafana/synthetic-monitoring-agent).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grafana/synthetic-monitoring-app/issues/318#issuecomment-865056022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACSGFSJSNENSR56L27DCXLTT5AWHANCNFSM47BBSSGQ .
In New Relic, for example, we can set interval from 1 minute to 1 day:

Hi @MurzNN, could you please provide some concrete examples of these use cases so that we can figure out how to implement something like this?
Right now the checks are run at a relative high frequency (compared to this request) because the metrics need to be kept alive. If we reduce the check frequency to something like 1 hour, the metrics would go stale. One possible way to fix this is to run the actual check every 1 hour but push the metrics every 2 minutes (using the most recently available value). This requires some changes in the way the agent works (grafana/synthetic-monitoring-agent).
@mem I see the docs here about what is defined to be an "active series" which I think is what you're referring to when you say "stale". It seems like it would be straightforward to implement up to a 29m window without trying to deal with the question of something being an "active series" (or 20m for a cleaner 3 checks/hour). Can you confirm if I am following this correctly?
I might not have read the page thoroughly enough but one use case of decreasing the frequency of a check would be to reduce the DPM of these checks as a cost optimization measure right? Going from the default Prometheus 15s to 1m is a 4x reduction, and going from 1/2m to up to 20m is a further 10-20x reduction for the things that aren't as critical but should still be checked.
I'd welcome any increase in both sliders (even up to the active series limit) but as another point of comparison in addition to the New Relic reference, UptimeRobot has a maximum timeout of up to a minute and their check frequency is a maximum of 24 hours. There are probably numerous other "health check" tools and products we could find this pattern in as well.
EDIT (1/11/24): I've noticed that decreasing the time from 1m to 2m reduces the log volume but not the DPM volume as I would expect. Either the calculation in the UI is wrong or the metrics don't decrease the DPM when increasing check time.