DB Cleanup: imported_*, Import History
Expected Behavior
The database should not grow endlessly
Current Behavior
imported_* tables grow over time, there is no related cleanup. Well, there is - but it isn't triggered unless old entries have been wiped from import_run. Even in large environments this should not be noticed. However, as soon as someone creates an Import source with very volatile columns (like "uptime" or similar), growth can become excessive.
Possible Solution
First of all, this isn't such a big problem for most environments. The involved tables are optimized for reading, so even if there are a few hundred million rows - it shouldn't hurt. But it wastes space on disk.
Cleanup strategies
- a) selectively deleting from those tables: is incredibly expensive. The Housekeeping Job does so and should continue to do so in the future, but it doesn't scale. It would work fine as long historic Import runs would be wiped from
import_runregularly, but this doesn't happen right now. - b) drop and re-create tables: not an easy task, as there are quite some constraints involved. We could of course provide related queries, but doing so would have an impact on "Purge" during the next Sync run. Not a good idea..
- c) start with fresh tables, copy recent data, drop the old ones, re-create them, fill them with preserved data: the only clean and efficient way to deal with environments that have already experienced data growth
We should provide a) and c). c) is important and should be triggered in the background on a future upgrade. No Import/Sync should run at that time. But don't worry, no disaster is to be expected in case they do. And we should provide a (configurable) cleanup strategy for import_run. Keeping only the 10 most recent imports might be a reasonable default. As soon as we delete rows in that table, existing cleanup procedures will take care about all imported_* tables.
IMORTANT NOTE for anyone reading this: YES, deleting old entries and running
housekeepingwill clean up those tables. NO, you shouldn't do so if you have GigaBytes of data in those tables. It will block you for a very long time
First draft of cleanup of very large tables - c):
DELETE FROM import_run
WHERE start_time < NOW() - INTERVAL 1 DAY;
OPTIMIZE TABLE import_run;
DROP TABLE IF EXISTS new_imported_rowset;
CREATE TABLE new_imported_rowset LIKE imported_rowset;
INSERT INTO new_imported_rowset
SELECT DISTINCT rowset_checksum FROM import_run;
DROP TABLE IF EXISTS new_imported_rowset_row;
CREATE TABLE new_imported_rowset_row LIKE imported_rowset_row;
INSERT INTO new_imported_rowset_row
SELECT rsr.*
FROM imported_rowset_row rsr
JOIN new_imported_rowset nrs
ON nrs.checksum = rsr.rowset_checksum;
DROP TABLE IF EXISTS new_imported_row;
CREATE TABLE new_imported_row LIKE imported_row;
INSERT INTO new_imported_row
SELECT r.*
FROM imported_row r
JOIN (
SELECT DISTINCT row_checksum AS row_checksum
FROM new_imported_rowset_row
) nrsr
ON nrsr.row_checksum = r.checksum;
DROP TABLE IF EXISTS new_imported_row_property;
CREATE TABLE new_imported_row_property LIKE imported_row_property;
INSERT INTO new_imported_row_property
SELECT rp.*
FROM imported_row_property rp
JOIN new_imported_row nr
ON nr.checksum = rp.row_checksum;
DROP TABLE IF EXISTS new_imported_property;
CREATE TABLE new_imported_property LIKE imported_property;
INSERT INTO new_imported_property
SELECT p.*
FROM imported_property p
JOIN (
SELECT DISTINCT property_checksum AS property_checksum
FROM new_imported_row_property
) nrp
ON nrp.property_checksum = p.checksum;
ALTER TABLE import_run DROP FOREIGN KEY import_run_rowset;
DROP TABLE imported_row_property;
DROP TABLE imported_property;
DROP TABLE imported_rowset_row;
DROP TABLE imported_row;
DROP TABLE imported_rowset;
CREATE TABLE imported_rowset (
checksum VARBINARY(20) NOT NULL,
PRIMARY KEY (checksum)
) ENGINE=InnoDB;
CREATE TABLE imported_row (
checksum VARBINARY(20) NOT NULL COMMENT 'sha1(object_name;property_checksum;...)',
object_name VARCHAR(255) NOT NULL,
PRIMARY KEY (checksum)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE imported_rowset_row (
rowset_checksum VARBINARY(20) NOT NULL,
row_checksum VARBINARY(20) NOT NULL,
PRIMARY KEY (rowset_checksum, row_checksum),
CONSTRAINT imported_rowset_row_rowset
FOREIGN KEY rowset_row_rowset (rowset_checksum)
REFERENCES imported_rowset (checksum)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT imported_rowset_row_row
FOREIGN KEY rowset_row_rowset (row_checksum)
REFERENCES imported_row (checksum)
ON DELETE RESTRICT
ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE imported_property (
checksum VARBINARY(20) NOT NULL,
property_name VARCHAR(64) NOT NULL,
property_value MEDIUMTEXT NOT NULL,
format enum ('string', 'expression', 'json'),
PRIMARY KEY (checksum),
KEY search_idx (property_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE imported_row_property (
row_checksum VARBINARY(20) NOT NULL,
property_checksum VARBINARY(20) NOT NULL,
PRIMARY KEY (row_checksum, property_checksum),
CONSTRAINT imported_row_property_row
FOREIGN KEY row_checksum (row_checksum)
REFERENCES imported_row (checksum)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT imported_row_property_property
FOREIGN KEY property_checksum (property_checksum)
REFERENCES imported_property (checksum)
ON DELETE RESTRICT
ON UPDATE CASCADE
) ENGINE=InnoDB;
INSERT INTO imported_property
SELECT * FROM new_imported_property;
INSERT INTO imported_row
SELECT * FROM new_imported_row;
INSERT INTO imported_row_property
SELECT * FROM new_imported_row_property;
INSERT INTO imported_rowset
SELECT * FROM new_imported_rowset;
INSERT INTO imported_rowset_row
SELECT * FROM new_imported_rowset_row;
ALTER TABLE import_run
ADD CONSTRAINT import_run_rowset
FOREIGN KEY rowset (rowset_checksum)
REFERENCES imported_rowset (checksum)
ON DELETE RESTRICT
ON UPDATE CASCADE;
DROP TABLE new_imported_property;
DROP TABLE new_imported_row_property;
DROP TABLE new_imported_rowset_row;
DROP TABLE new_imported_row;
DROP TABLE new_imported_rowset;
Hint: INTERVAL 1 DAY might not fit those who are manually running all/some import sources only from time to time, this should be replaced with a "delete all but the last 10 runs per import source". Anyways, everybody who triggers imports in a manual way will probably not experience noticeable growth at all.
This has been tested in multiple large environments. This can be rolled out with v1.8. With our new background daemon such and similar tasks are now possible.
A question that came to my mind: Does this have an effect on purge handling with sync?
Why are we copying twice to new tables, instead of renaming the tables once the required data has been copied?
Okay I see, MySQL can not really handle duplicate constraint names and likely will have problems when renaming tables as well...
Here is a better approach to cleanup import_run - it will keep the last X runs for any source, so we won't have to rely on timestamps.
-- SELECT source_id, id, start_time FROM import_run
DELETE FROM import_run
WHERE id IN (
SELECT id FROM (
SELECT source_id, id, start_time,
@run_rank := IF(@current_source = source_id, @run_rank + 1, 1) AS run_rank,
@current_source := source_id
FROM import_run
ORDER BY source_id ASC, id DESC
) tmp
WHERE run_rank > 10
);
I'm not too worried about speed here:
Query OK, 9141 rows affected, 1 warning (0.09 sec)
Here's a import_run cleanup query for PostgreSQL keeping only the most recent 50 runs per source:
WITH import_counts AS (
SELECT
id,
row_number() OVER (PARTITION BY source_id ORDER BY start_time DESC)
FROM import_run)
DELETE FROM import_run r USING import_counts c
WHERE r.id = c.id
AND c.row_number > 50;
... and then run icingacli director housekeeping run ALL and VACUUM FULL on the affected import* tables.
ref/NC/727723
ref/IP/34207
ref/NC/743796 ref/IP/38165
ref/NC/767016