icingaweb2-module-director icon indicating copy to clipboard operation
icingaweb2-module-director copied to clipboard

DB Cleanup: imported_*, Import History

Open Thomas-Gelf opened this issue 6 years ago • 25 comments

Expected Behavior

The database should not grow endlessly

Current Behavior

imported_* tables grow over time, there is no related cleanup. Well, there is - but it isn't triggered unless old entries have been wiped from import_run. Even in large environments this should not be noticed. However, as soon as someone creates an Import source with very volatile columns (like "uptime" or similar), growth can become excessive.

Possible Solution

First of all, this isn't such a big problem for most environments. The involved tables are optimized for reading, so even if there are a few hundred million rows - it shouldn't hurt. But it wastes space on disk.

Cleanup strategies

  • a) selectively deleting from those tables: is incredibly expensive. The Housekeeping Job does so and should continue to do so in the future, but it doesn't scale. It would work fine as long historic Import runs would be wiped from import_run regularly, but this doesn't happen right now.
  • b) drop and re-create tables: not an easy task, as there are quite some constraints involved. We could of course provide related queries, but doing so would have an impact on "Purge" during the next Sync run. Not a good idea..
  • c) start with fresh tables, copy recent data, drop the old ones, re-create them, fill them with preserved data: the only clean and efficient way to deal with environments that have already experienced data growth

We should provide a) and c). c) is important and should be triggered in the background on a future upgrade. No Import/Sync should run at that time. But don't worry, no disaster is to be expected in case they do. And we should provide a (configurable) cleanup strategy for import_run. Keeping only the 10 most recent imports might be a reasonable default. As soon as we delete rows in that table, existing cleanup procedures will take care about all imported_* tables.

IMORTANT NOTE for anyone reading this: YES, deleting old entries and running housekeeping will clean up those tables. NO, you shouldn't do so if you have GigaBytes of data in those tables. It will block you for a very long time

Thomas-Gelf avatar Aug 01 '19 06:08 Thomas-Gelf

First draft of cleanup of very large tables - c):

DELETE FROM import_run
    WHERE start_time < NOW() - INTERVAL 1 DAY;
OPTIMIZE TABLE import_run;

DROP TABLE IF EXISTS new_imported_rowset;
CREATE TABLE new_imported_rowset LIKE imported_rowset;
INSERT INTO new_imported_rowset
  SELECT DISTINCT rowset_checksum FROM import_run;

DROP TABLE IF EXISTS new_imported_rowset_row;
CREATE TABLE new_imported_rowset_row LIKE imported_rowset_row;
INSERT INTO new_imported_rowset_row
  SELECT rsr.*
    FROM imported_rowset_row rsr
    JOIN new_imported_rowset nrs
      ON nrs.checksum = rsr.rowset_checksum;

DROP TABLE IF EXISTS new_imported_row;
CREATE TABLE new_imported_row LIKE imported_row;
INSERT INTO new_imported_row
  SELECT r.*
    FROM imported_row r
    JOIN (
      SELECT DISTINCT row_checksum AS row_checksum
        FROM new_imported_rowset_row
    ) nrsr
      ON nrsr.row_checksum = r.checksum;

DROP TABLE IF EXISTS new_imported_row_property;
CREATE TABLE new_imported_row_property LIKE imported_row_property;
INSERT INTO new_imported_row_property
  SELECT rp.*
    FROM imported_row_property rp
    JOIN new_imported_row nr
      ON nr.checksum = rp.row_checksum;

DROP TABLE IF EXISTS new_imported_property;
CREATE TABLE new_imported_property LIKE imported_property;
INSERT INTO new_imported_property
  SELECT p.*
    FROM imported_property p
    JOIN (
      SELECT DISTINCT property_checksum AS property_checksum
        FROM new_imported_row_property
    ) nrp
      ON nrp.property_checksum = p.checksum;

ALTER TABLE import_run DROP FOREIGN KEY import_run_rowset;

DROP TABLE imported_row_property;
DROP TABLE imported_property;
DROP TABLE imported_rowset_row;
DROP TABLE imported_row;
DROP TABLE imported_rowset;

CREATE TABLE imported_rowset (
  checksum VARBINARY(20) NOT NULL,
  PRIMARY KEY (checksum)
) ENGINE=InnoDB;

CREATE TABLE imported_row (
  checksum VARBINARY(20) NOT NULL COMMENT 'sha1(object_name;property_checksum;...)',
  object_name VARCHAR(255) NOT NULL,
  PRIMARY KEY (checksum)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE imported_rowset_row (
  rowset_checksum VARBINARY(20) NOT NULL,
  row_checksum VARBINARY(20) NOT NULL,
  PRIMARY KEY (rowset_checksum, row_checksum),
  CONSTRAINT imported_rowset_row_rowset
    FOREIGN KEY rowset_row_rowset (rowset_checksum)
    REFERENCES imported_rowset (checksum)
    ON DELETE CASCADE
    ON UPDATE CASCADE,
  CONSTRAINT imported_rowset_row_row
    FOREIGN KEY rowset_row_rowset (row_checksum)
    REFERENCES imported_row (checksum)
    ON DELETE RESTRICT
    ON UPDATE CASCADE
) ENGINE=InnoDB;

CREATE TABLE imported_property (
  checksum VARBINARY(20) NOT NULL,
  property_name VARCHAR(64) NOT NULL,
  property_value MEDIUMTEXT NOT NULL,
  format enum ('string', 'expression', 'json'),
  PRIMARY KEY (checksum),
  KEY search_idx (property_name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE imported_row_property (
  row_checksum VARBINARY(20) NOT NULL,
  property_checksum VARBINARY(20) NOT NULL,
  PRIMARY KEY (row_checksum, property_checksum),
  CONSTRAINT imported_row_property_row
    FOREIGN KEY row_checksum (row_checksum)
    REFERENCES imported_row (checksum)
    ON DELETE CASCADE
    ON UPDATE CASCADE,
  CONSTRAINT imported_row_property_property
    FOREIGN KEY property_checksum (property_checksum)
    REFERENCES imported_property (checksum)
    ON DELETE RESTRICT
    ON UPDATE CASCADE
) ENGINE=InnoDB;

INSERT INTO imported_property
  SELECT * FROM new_imported_property;
INSERT INTO imported_row
  SELECT * FROM new_imported_row;
INSERT INTO imported_row_property
  SELECT * FROM new_imported_row_property;
INSERT INTO imported_rowset
  SELECT * FROM new_imported_rowset;
INSERT INTO imported_rowset_row
  SELECT * FROM new_imported_rowset_row;

ALTER TABLE import_run 
  ADD CONSTRAINT import_run_rowset
    FOREIGN KEY rowset (rowset_checksum)
    REFERENCES imported_rowset (checksum)
    ON DELETE RESTRICT
    ON UPDATE CASCADE;

DROP TABLE new_imported_property;
DROP TABLE new_imported_row_property;
DROP TABLE new_imported_rowset_row;
DROP TABLE new_imported_row;
DROP TABLE new_imported_rowset;

Thomas-Gelf avatar Aug 01 '19 06:08 Thomas-Gelf

Hint: INTERVAL 1 DAY might not fit those who are manually running all/some import sources only from time to time, this should be replaced with a "delete all but the last 10 runs per import source". Anyways, everybody who triggers imports in a manual way will probably not experience noticeable growth at all.

Thomas-Gelf avatar Aug 01 '19 06:08 Thomas-Gelf

This has been tested in multiple large environments. This can be rolled out with v1.8. With our new background daemon such and similar tasks are now possible.

Thomas-Gelf avatar Sep 27 '19 08:09 Thomas-Gelf

A question that came to my mind: Does this have an effect on purge handling with sync?

lazyfrosch avatar Jul 08 '20 15:07 lazyfrosch

Why are we copying twice to new tables, instead of renaming the tables once the required data has been copied?

lazyfrosch avatar Jan 27 '21 16:01 lazyfrosch

Okay I see, MySQL can not really handle duplicate constraint names and likely will have problems when renaming tables as well...

lazyfrosch avatar Jan 27 '21 16:01 lazyfrosch

Here is a better approach to cleanup import_run - it will keep the last X runs for any source, so we won't have to rely on timestamps.

-- SELECT source_id, id, start_time FROM import_run
DELETE FROM import_run
WHERE id IN (
  SELECT id FROM (
    SELECT source_id, id, start_time,
      @run_rank := IF(@current_source = source_id, @run_rank + 1, 1) AS run_rank,
      @current_source := source_id
    FROM import_run
    ORDER BY source_id ASC, id DESC
    ) tmp
  WHERE run_rank > 10
);

I'm not too worried about speed here:

Query OK, 9141 rows affected, 1 warning (0.09 sec)

lazyfrosch avatar Jan 29 '21 10:01 lazyfrosch

Here's a import_run cleanup query for PostgreSQL keeping only the most recent 50 runs per source:

WITH import_counts AS (
    SELECT
        id,
        row_number() OVER (PARTITION BY source_id ORDER BY start_time DESC)
    FROM import_run)
DELETE FROM import_run r USING import_counts c
WHERE r.id = c.id
    AND c.row_number > 50;

... and then run icingacli director housekeeping run ALL and VACUUM FULL on the affected import* tables.

df7cb avatar May 11 '21 13:05 df7cb

ref/NC/727723

carraroj avatar Jun 30 '21 10:06 carraroj

ref/IP/34207

bobapple avatar Jun 30 '21 10:06 bobapple

ref/NC/743796 ref/IP/38165

lippserd avatar Feb 08 '22 10:02 lippserd

ref/NC/767016

tbauriedel avatar Nov 24 '22 08:11 tbauriedel