Allow thold commands to be queued and run in parallel so as to not affect the polling process
Hey Everyone
Cacti V1.2.12 THOLD 1.4
We did a test today where we triggered 1600+ thresholds for testing into our ticketing system all of the thresholds are set to execute a script to call an API for our ticketing system
What we noticed is that under load some of the thresholds executed multiple times
See below for an example
2020/06/16 10:17:08 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:09 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:10 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:28 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:28 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:35 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:38 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:42 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201
I also see that in thold log the threshold went into trigger state the same amount of times now during regular operation we dont see this behaviour so maybe some sort of queue issue ?
Yea, this would be a problem if the queries blocked. Instead they should be queued up and executed in parallel and in the background. If it's the exception and not the rule, and we don't care about the exit status, we can just background them all, though it might overload smaller VM's.
Also, my guess is that thold_process was backed up, and broke loose at the same time. Are you using the thold_daemon?
@TheWitness
we are not using the daemon I am actually really surprised at the performance of thold without the daemon although we have a massive vm for our Main instance so that could also be it as well
Since your last bunch of updates Thold has been pretty fast
I think you might be right possibly the process went sideways because of the flood of alerts ?? I think it would be awesome to have a que mechanism as well so you can see what still need to be processed and even clear the que if you needed to
Again this only happens under load and in reality a unrealistic load for threshold breaches if we were talking up/down alerts that would be different in my view anyway
I think the right thing to do here @bmfmancini is to open a feature request to queue the thold notifications. Performing serial sendmails is not going to be efficient, and could be done in parallel. Maybe the command to notify your ticketing system is real slow. Maybe that has something to do with it too.
Will do @TheWitness
@bmfmancini and @netniV, I've started to implement this. It's likely I will not finish until tomorrow or maybe early next week.
@bmfmancini and @xmacan , I'm going to log this as 'complete'. We just need to QA. I'll open a separate ticket for parallelization.