marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Proposal: Time series data for graph population.

Open phixMe opened this issue 6 months ago • 0 comments

We're looking to power a new dashboard to power some operational value to engineers with the following screenshot. Part of this is to build charts that are reliant upon time series data.

image

Approach 1

Create a simple record of lineage events with their state as single rows and add an index of the record time.

Pros
  • Simple to write
  • Timestamps for each event is stored
  • Flexible slice intervaling (Can get results for 10 minutes, 1 hour, 1 day, etc)
Cons
  • Some work to aggregate counts of like records during query time (see below)
  • Row consumption scales with lineage events, so it could be large.
CREATE TABLE data_ops_lineage_metrics
(
    id          SERIAL PRIMARY KEY,
    metric_time TIMESTAMP    NOT NULL,
    state       VARCHAR(64)  NOT NULL
);

CREATE INDEX idx_metric_time_name ON data_ops_lineage_metrics (metric_time);
SELECT
    gs.start_interval AS start_interval,
    gs.start_interval + :intervalStep * ('1 ' || :intervalUnit)::INTERVAL AS end_interval,
    COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'FAIL'), 0) AS fail,
    COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'START'), 0) AS start,
    COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'COMPLETE'), 0) AS complete,
    COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'ABORT'), 0) AS abort
FROM
    generate_series(
        :startInterval::TIMESTAMP,
        :endInterval::TIMESTAMP,
        :intervalStep * ('1 ' || :intervalUnit)::INTERVAL
    ) AS gs(start_interval)
LEFT JOIN (
    SELECT
        id,
        DATE_TRUNC(:dateTruncUnit, metric_time) AS metric_time_truncated,
        state
    FROM
        data_ops_lineage_metrics
    WHERE
        metric_time >= :metricStartInterval::TIMESTAMP
) dom ON gs.start_interval = dom.metric_time_truncated
GROUP BY
    gs.start_interval
ORDER BY
    gs.start_interval;

Approach 2

This means that we would upsert as opposed to creating new records on insertions which would result in fewer rows that would need to be written dramatically.

CREATE TABLE data_ops_lineage_metrics_summary (
    interval_start TIMESTAMP NOT NULL,
    interval_end TIMESTAMP NOT NULL,
    fail_count INTEGER DEFAULT 0,
    start_count INTEGER DEFAULT 0,
    complete_count INTEGER DEFAULT 0,
    abort_count INTEGER DEFAULT 0,
    PRIMARY KEY (interval_start)
);
INSERT INTO data_ops_lineage_metrics_summary (interval_start, interval_end, fail_count, start_count, complete_count, abort_count)
SELECT
    DATE_TRUNC('hour', metric_time) AS interval_start,
    DATE_TRUNC('hour', metric_time) + INTERVAL '1 hour' AS interval_end,
    COUNT(*) FILTER (WHERE state = 'FAIL') AS fail_count,
    COUNT(*) FILTER (WHERE state = 'START') AS start_count,
    COUNT(*) FILTER (WHERE state = 'COMPLETE') AS complete_count,
    COUNT(*) FILTER (WHERE state = 'ABORT') AS abort_count
FROM
    data_ops_lineage_metrics
WHERE
    metric_time >= NOW() - INTERVAL '1 day'
GROUP BY
    interval_start
ON CONFLICT (interval_start)
DO UPDATE SET
    fail_count = EXCLUDED.fail_count,
    start_count = EXCLUDED.start_count,
    complete_count = EXCLUDED.complete_count,
    abort_count = EXCLUDED.abort_count;
SELECT *
FROM data_ops_lineage_metrics_summary
WHERE interval_start >= NOW() - INTERVAL '1 day'
ORDER BY interval_start;
Pros
  • Fewer rows, by a lot
  • Better performance for very large datasets
  • Rows translate more directly to output, database reflects final state more closely
Cons
  • Inflexible time delimiters since slices are built into the storage layer. A time granularity needs to be chosen, and it cannot become more granular. If data is stored per hour, we cannot kind the minute precision for example
  • Less precise
  • Harder to write, upsert

phixMe avatar Aug 28 '24 00:08 phixMe