collectd-graphite icon indicating copy to clipboard operation
collectd-graphite copied to clipboard

Threshold email overload on failed server

Open wr0ngway opened this issue 13 years ago • 0 comments

I had a situation with my graphite server today, which caused connections to it from this plugin to fail, which resulted in thousands of emails getting sent to my ops email (hundreds from each collectd node).

I have thresholds and notifications setup on each node to email me when numbers get too high, and it also notifies me when a metric hasn't been updated within N iterations. The emails I got during the failure were all of the latter type.

I think this is happening because a failure in send_to_graphite seems to cause the data to stop getting through to the rest of collectd, and thus it thinks metrics aren't getting updated. Not sure how perl works, but maybe an exception is getting thrown which propagates back into collectd and aborts the data collection? Can you wrap all of send_to_graphite in an exception handler and log/ignore? The failure may not only be on connect, but on write (my graphite server was having IO issues, so connection was sometimes ok, but write was failing)

wr0ngway avatar Oct 07 '11 13:10 wr0ngway