plugin_thold Allow thold commands to be queued and run in parallel so as to not affect the polling process

Hey Everyone

Cacti V1.2.12 THOLD 1.4

We did a test today where we triggered 1600+ thresholds for testing into our ticketing system all of the thresholds are set to execute a script to call an API for our ticketing system

What we noticed is that under load some of the thresholds executed multiple times

See below for an example

2020/06/16 10:17:08 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:09 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:10 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:28 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:28 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:35 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:38 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201 2020/06/16 10:17:42 - THOLD NOTE: Threshold command execution for TH[44588] returned 0, with output 201

I also see that in thold log the threshold went into trigger state the same amount of times now during regular operation we dont see this behaviour so maybe some sort of queue issue ?

Jun 16 '20 15:06 bmfmancini

Yea, this would be a problem if the queries blocked. Instead they should be queued up and executed in parallel and in the background. If it's the exception and not the rule, and we don't care about the exit status, we can just background them all, though it might overload smaller VM's.

Jun 16 '20 21:06 TheWitness

Also, my guess is that thold_process was backed up, and broke loose at the same time. Are you using the thold_daemon?

Jun 16 '20 21:06 TheWitness

@TheWitness

we are not using the daemon I am actually really surprised at the performance of thold without the daemon although we have a massive vm for our Main instance so that could also be it as well

Since your last bunch of updates Thold has been pretty fast

I think you might be right possibly the process went sideways because of the flood of alerts ?? I think it would be awesome to have a que mechanism as well so you can see what still need to be processed and even clear the que if you needed to

Again this only happens under load and in reality a unrealistic load for threshold breaches if we were talking up/down alerts that would be different in my view anyway

Jun 17 '20 01:06 bmfmancini

I think the right thing to do here @bmfmancini is to open a feature request to queue the thold notifications. Performing serial sendmails is not going to be efficient, and could be done in parallel. Maybe the command to notify your ticketing system is real slow. Maybe that has something to do with it too.

Sep 12 '20 16:09 TheWitness

Will do @TheWitness

Sep 18 '20 19:09 bmfmancini

@bmfmancini and @netniV, I've started to implement this. It's likely I will not finish until tomorrow or maybe early next week.

Jul 23 '23 01:07 TheWitness

@bmfmancini and @xmacan , I'm going to log this as 'complete'. We just need to QA. I'll open a separate ticket for parallelization.

Jul 25 '23 00:07 TheWitness