signoz icon indicating copy to clipboard operation
signoz copied to clipboard

[EPIC] Support for Anomaly Detection

Open vanakema opened this issue 2 years ago • 9 comments

Is your feature request related to a problem?

When you have a small team, you want to know when you're app is misbehaving, with a little intervention as possible

Describe the solution you'd like

SigNoz integrates an open source anomaly detection library, to alert users if anything gets out "normal" range

Some usecase:

  1. Abnormal latency (latency spiking) on certain DB queries
  2. Abnormal latency (latency spiking) on certain flask endpoints
  3. Abnormal error rate on certain endpoints
  4. Abnormal requests/s

Describe alternatives you've considered

Really the only alternative would be manually creating alerts in Promethease or feeding SigNoz metrics into an anomaly detection library ourselves

Additional context

The DataDog WatchDog feature is great because of the automatic detection of anomalous behavior, and is really helpful when you have a small team, or a team without a dedicated SRE person, since you no longer have to know what to look for necessarily.

Thank you for your feature request – we love each and every one!

vanakema avatar Sep 16 '21 18:09 vanakema

Figured this might be a helpful repo for reference https://github.com/rob-med/awesome-TS-anomaly-detection

vanakema avatar Sep 16 '21 18:09 vanakema

Thanks @vanakema for detailing out the use cases. Anomaly detection IS in our roadmap - but a few months down the line.

Curious, what sort of algos worked best for you for detecting "abnormal" values? Does a simple threshold rolling average works good enough or more advanced algos like seasonal pattern detection etc. are needed

pranay01 avatar Sep 16 '21 18:09 pranay01

Gitlab has written about basic anomaly detection using Prometheus rules using z-score and seasonality. https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/

Such sort of things would be possible with SigNoz also as we plan SigNoz to be compatible with Prometheus rules and alertmanager.

ankitnayan avatar Sep 17 '21 04:09 ankitnayan

We can also leverage Third Eye

This is built for Apache Pinot which an OLAP database similar to ClickHouse

pranay01 avatar Jul 26 '22 11:07 pranay01

Might be worth while asking the netdata team on lessons learnt applying ML to time series.

nwmcsween avatar Jul 19 '23 21:07 nwmcsween

Thanks for the note @nwmcsween Do you think Netdata does a good job applying ML to time series data? Any blogs/issues where they share more about it?

pranay01 avatar Jul 20 '23 11:07 pranay01

@pranay01 Namaste Especially ML and alarms is the specialty of netdata. It's worth it to have a look at it. I speak from 30 years of experience with Nagios, Zabbix, Elastic, Opensearch, Influx, and many more including Netdata. Netdata is top-heavy more on *nix than on Windows and lacks otel integration. That's why I'm looking at you guys right now. 😃

StefanSa avatar Jul 29 '23 19:07 StefanSa

Thanks @StefanSa - do you have relevant docs in NetData I should look at?

pranay01 avatar Jul 30 '23 11:07 pranay01

@pranay01 Certainly not a problem. There is a lot of reading material here, as said alerting is also well done there.

ML: https://learn.netdata.cloud/docs/ml-and-troubleshooting/machine-learning-ml-powered-anomaly-detection

https://learn.netdata.cloud/docs/ml-and-troubleshooting/anomaly-advisor

https://learn.netdata.cloud/docs/visualizations/netdata-charts#anomaly-rate-ribbon

https://learn.netdata.cloud/docs/ml-and-troubleshooting/metric-correlations

https://www.youtube.com/watch?v=2gJ36YuW6Ko

Alerting: https://learn.netdata.cloud/docs/alerting/

Live Demo: Live-Demo

StefanSa avatar Jul 30 '23 17:07 StefanSa