marquez
marquez copied to clipboard
Proposal: Time series data for graph population.
We're looking to power a new dashboard to power some operational value to engineers with the following screenshot. Part of this is to build charts that are reliant upon time series data.
Approach 1
Create a simple record of lineage events with their state as single rows and add an index of the record time.
Pros
- Simple to write
- Timestamps for each event is stored
- Flexible slice intervaling (Can get results for 10 minutes, 1 hour, 1 day, etc)
Cons
- Some work to aggregate counts of like records during query time (see below)
- Row consumption scales with lineage events, so it could be large.
CREATE TABLE data_ops_lineage_metrics
(
id SERIAL PRIMARY KEY,
metric_time TIMESTAMP NOT NULL,
state VARCHAR(64) NOT NULL
);
CREATE INDEX idx_metric_time_name ON data_ops_lineage_metrics (metric_time);
SELECT
gs.start_interval AS start_interval,
gs.start_interval + :intervalStep * ('1 ' || :intervalUnit)::INTERVAL AS end_interval,
COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'FAIL'), 0) AS fail,
COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'START'), 0) AS start,
COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'COMPLETE'), 0) AS complete,
COALESCE(COUNT(dom.id) FILTER (WHERE dom.state = 'ABORT'), 0) AS abort
FROM
generate_series(
:startInterval::TIMESTAMP,
:endInterval::TIMESTAMP,
:intervalStep * ('1 ' || :intervalUnit)::INTERVAL
) AS gs(start_interval)
LEFT JOIN (
SELECT
id,
DATE_TRUNC(:dateTruncUnit, metric_time) AS metric_time_truncated,
state
FROM
data_ops_lineage_metrics
WHERE
metric_time >= :metricStartInterval::TIMESTAMP
) dom ON gs.start_interval = dom.metric_time_truncated
GROUP BY
gs.start_interval
ORDER BY
gs.start_interval;
Approach 2
This means that we would upsert as opposed to creating new records on insertions which would result in fewer rows that would need to be written dramatically.
CREATE TABLE data_ops_lineage_metrics_summary (
interval_start TIMESTAMP NOT NULL,
interval_end TIMESTAMP NOT NULL,
fail_count INTEGER DEFAULT 0,
start_count INTEGER DEFAULT 0,
complete_count INTEGER DEFAULT 0,
abort_count INTEGER DEFAULT 0,
PRIMARY KEY (interval_start)
);
INSERT INTO data_ops_lineage_metrics_summary (interval_start, interval_end, fail_count, start_count, complete_count, abort_count)
SELECT
DATE_TRUNC('hour', metric_time) AS interval_start,
DATE_TRUNC('hour', metric_time) + INTERVAL '1 hour' AS interval_end,
COUNT(*) FILTER (WHERE state = 'FAIL') AS fail_count,
COUNT(*) FILTER (WHERE state = 'START') AS start_count,
COUNT(*) FILTER (WHERE state = 'COMPLETE') AS complete_count,
COUNT(*) FILTER (WHERE state = 'ABORT') AS abort_count
FROM
data_ops_lineage_metrics
WHERE
metric_time >= NOW() - INTERVAL '1 day'
GROUP BY
interval_start
ON CONFLICT (interval_start)
DO UPDATE SET
fail_count = EXCLUDED.fail_count,
start_count = EXCLUDED.start_count,
complete_count = EXCLUDED.complete_count,
abort_count = EXCLUDED.abort_count;
SELECT *
FROM data_ops_lineage_metrics_summary
WHERE interval_start >= NOW() - INTERVAL '1 day'
ORDER BY interval_start;
Pros
- Fewer rows, by a lot
- Better performance for very large datasets
- Rows translate more directly to output, database reflects final state more closely
Cons
- Inflexible time delimiters since slices are built into the storage layer. A time granularity needs to be chosen, and it cannot become more granular. If data is stored per hour, we cannot kind the minute precision for example
- Less precise
- Harder to write, upsert