Describe the bug

We had an issue over the weekend, which did not recover, due to the sheer amount of data that was collected. The bug is related to the series of SQL queries done in the poller_recovery.php script, as well as the workings of the mysql daemon and the host's disk space. Details below.

We saw the smaller machines with just a few hundred devices came back after a while, but the really big ones were lost in Heartbeat mode with locally stored data growing more and more.

To Reproduce

Steps to reproduce the behavior:

Break the connectivity between main and remote.
leave it in that state until your poller_output_boost table on the remote is several tens of millions of items, or even hitting the storage limits. This will bring the "best effect" (explanation/details why -> see below)
recover the connectivity, the poller_recovery.php should kick in
depending on the DB capacity it will crash in a SQL query (cacti log shows /var/tmp/ could not be written)

Expected behavior

Recovery should work smoothly and transfer the data, no matter how big the number of entries grows. Scripts/queries should not crash with out of memory.

Additional context

The gory details are as usual a bit more complex to reproduce. But i think the issue is resolvable in an elegant way.

Currently the recovery mechanism relies on a sequence of SQL queries, which are quite costly, as we sadly experienced.

the poller_output_boost table is sorted by timestamp
the query looks into the first 150k entries and picks the highest timestamp (as i understood it, this is necessary to ensure that we don't split data that belongs into the same RRD file)
the recovery grabs the items up to that timestamp and transfers them to the main poller
the processed items get deleted from the table

With tens of millions of data points stored in the table, specially the "sort by timetamp" can be: a) quite long running b) very memory consuming In fact, it seems the subquery starts in a temporary table, which if not fitting into the RAM, is written on the disk. And if your disk space in /var/tmp/ is also not the biggest (which is probably a rather normal situation, that folder usually has a few gigs max), the temp file will use up the whole disk space and the query crashes. This was the case in our setup.

My idea to remodel the recovery process is:

Introduce a SQL query which takes less time and storage, and let it run once, and based on the output, split the data into time frames. Then grab those time frames as usual and transfer the data.

What can also be of great help would be to keep the number of entries in the poller_output_boost table lower, and rename the tables after they reached a certain number of entries (configurable per poller?).

Back to the SQL query and the proposal for the new approach:

SQL can query easily, how many items are in the table, what is the min and the max timestamp of all the data that is currently stored. SELECT count(1), min(time), max(time) FROM poller_output_boost; This would not take any sorting at all, and it can all be done in one and the same pass. So the order of processing time should be linear to the amount of data.

From the information returned plus the wanted number of data points per batch, we can calculate:

the number of batches: batch_count = count / points_per_batch
the time-step between two batches: delta = (max_time - min_time) / batch_count

Now start with the "min" time, and select all data points between "min2 and "min+delta" Then increase the batch number and add delta to both times in the next query ... and so on.

This would need no sorting at all, so no temp tables are generated, and also it works on dedicated timestamps to avoid gaps in the data. And even if the calculated timestamps are not actually existing in the table, they still are valid borders for the batches.

Not sure if it is necessary to actually delete transferred items, or if it would be nicer to have the precalculated borders and the current batch number stored somewhere in SQL.

This should in total work and also should be halfway easy to adapt the new mechanism. What do you think?

Jan 24 '23 19:01 bernisys

p.S. this is somehow related to our old topic #4215

Jan 24 '23 19:01 bernisys

You failed to state your Cacti release.

Jan 24 '23 20:01 TheWitness

Hi Larry, sorry, indeed. We're currently still on 1.2.18 but already planned to go to 1.2.23 very soon. I checked the code though, and as i can see, the mechanisms seem to be still the same in 1.2.23 - so same applies to that version.

Jan 24 '23 20:01 bernisys

The theory does sound OK. Although the delta between batches can vary given that depends on the number of its. Also, just because you go to the max timestamp within a given limit, that timestamp may have items in the next batch. Of course, the understanding on this is based on the words presented rather than actual SQL query so if you have the query itself, that would help broaden the understanding and goal.

Jan 25 '23 00:01 netniV

I made a fundamental change to the batch size that fixed recovery and made it super fast. Let me look for the ticket.

Jan 25 '23 00:01 TheWitness

Fixed here:

-issue#5101: When a remote poller fails, the recovery process may also fail

That was 1.2.23. but you might want to wait. We are almost 1.2.24.

Jan 25 '23 00:01 TheWitness

We will go for 23 for now, as we are already in testing and we have almost finalised that. People are already waiting for the push out. But thanks for noticing that ticket and the new release, this might also help. We will immediately catch up with the new version as soon as it is out. :)

@netniV sorry, i thought it would be already clear from my descriptions. the query that caused the crash is this one:

$max_time = db_fetch_cell("SELECT MAX(time)
  FROM (
    SELECT time
    FROM poller_output_boost
    ORDER BY time ASC
    LIMIT $record_limit
  ) AS rs", '', true, $local_db_cnn_id);

The sorting is the biggest problem in our case, as the created temp table was not fitting into anything (RAM or disk) and was as a result dropped halfway in the action. Hence i came up with the idea to create the batches differently.

Yes you are right, the batches can be larger in certain situations, but this should not do any harm. If the current mechanism has to deal with smaller amounts of data, it sometimes also generates larger batches and still runs pretty smooth even with 250k+ items (seen that already in our setup). Also there is always the possibility to split an overly large batch further into half, if needed.

Maybe give it a try with a switch in the global settings? "Activate alternative recovery algorithm" or so?

@TheWitness as mentioned, i saw that the query that caused our problem is still in 1.2.23, so it is probably not yet fixed completely. Do you have a chance to recreate the situation in a test setup on your side? I think the steps to reproduce are pretty clear - if you have any questions, just reach out to us! Always happy to help getting this thing improved! :)

Thanks so far for all the great support on a great product! Bernd

Jan 25 '23 14:01 bernisys

One more thing ....

the recovery could need a bit more verbosity, i added a few lines for debugging on our side during this issue. Maybe once we have 1.2.23 running, i can take some time to scan thru the additions and create a pull request so that you can use that part to enhance verbosity for the recovery process. Always good to have a few hints in the logs, and currently this part is not too talkative.

Jan 25 '23 14:01 bernisys

@bernisys, are you good doing pull requests?

Jan 25 '23 22:01 TheWitness

@bernisys, hmm, if there is an index on time, it should not take too long. BTW, that's a lot of records though. It's a good catch, to make it more efficient:

ALTER TABLE poller_output_boost ADD index `time`(`time`);

That's going to speed it up big time as MariaDB will use the index. Do it after a poller run. Might have to replicate that change to all remotes manually.

Jan 25 '23 22:01 TheWitness

I'm going with 1.2.24 this weekend!

Jan 25 '23 22:01 TheWitness

@TheWitness Hi Larry,

regarding 1.2.24 - wow you're releasing pretty fast, thanks for the heads-up, i will start planning for the next update already

regarding the ALTER TABLE - ok this will indeed speed up searches, eveen if inserts would be a bit slower then, but i think this can be considered uncritical. But since the poller_output_boost table is re-created after rotation (during boost cycle) at least on the main poller, this needs to go into the table create code anyway. Do you intend to push this into 1.2.24 already? Shouldn't be a too big change in the code, and it is indeed helpful.

Another thing which i think is worth mentioning and looking into: When the boost table is growing too large on the main poller, we would run into the same problem there, so maybe a throttling parameter could be necessary, in order to tell the remote pollers to send data slower, or just pause for a bit, until the main poller has started processing the current poller_output_boost table (i.e. renaming it into the temp table, you probably know what i mean).

It could be also a good idea to split the poller_output_boost table into smaller portions during normal operation. One problem with larger tables is that the inserts are taking longer and longer, which can lead to data loss. We've seen this in our environment, where one of the pollers was on a system load of 250 when new data was coming in, and it was not able to process the data in time.

So to be braced for longer outages, it could be a good idea to introduce a kind of rotation there, on both main and remote pollers. Quite similar to renaming the table when the boost transfer into the RRD files starts. Tricky part is probably that all the code parts which are handling the table need to be made aware of this splitting, so that they sequentially iterate thru all of the resulting tables.

Finally, regarding pull requests, yes of course, i'm generally fine with them. Always happy to contribute! I just need to check the new 2FA mechanism to push code .. i was not too active in development in the past, since 2FA was introduced in github (shame on me). Probably i just need to deploy a pubkey to make it work smoothly without being nagged too much on every action. Need to find the time to get this working again properly.

Jan 26 '23 10:01 bernisys

@bernisys, my boost runs at about 40 million records these days. Since moving to the archive scheme with no deletes, it scales just fine, I'll add that index, but I'll not add it in the upgrade script, but the boost processing part.

Jan 26 '23 10:01 TheWitness

Removed the duplicate as this minor change, the new index, is really a new scalability enhancement.

Jan 26 '23 15:01 TheWitness

Index is now a part of the base install. Closing this now.

Dec 03 '24 12:12 TheWitness

poller_recovery.php can probably be made more efficient

Describe the bug

To Reproduce

Expected behavior

Additional context