Try and implement the Tpc-DI benchmark
This issue looks at implementing the Tpc-DI benchmark to validate that features exist to support those type of implementations.
This includes topics such as slowely changing dimensions with history tracking, parsing csv files etc.
Support for slowely changing dimensions
SCD Type 2 is a requirement for Tpc-DI. Changes to a dimension row should be grouped by date and only the latest result of that date should be used as the value, so if three changes happens on 2001-02-03, the state from the last row should be used.
Solution 1
To support this, the following type of query is proposed to solve that issue:
SELECT
surrogate_key,
id,
date AS StartDate,
COALESCE(LEAD(date) OVER (PARTITION BY id ORDER BY date), '9999-12-31') AS EndDate,
LEAD(date) OVER (PARTITION BY id ORDER BY date) IS NULL AS IsCurrent,
payload[1] AS c1,
payload[2] AS c2,
payload[3] AS c3
FROM (
SELECT
surrogate_key_int64() as surrogate_key,
id,
date,
max_by(list(c1, c2, c3), SentTimestamp) AS payload
FROM accounts
GROUP BY id, date
) AS grouped
The values are grouped by the keys that require a new row to be formed, the last value is selected from those values. In Tpc-DI only the latest values per day should be used.
Surrogate key function is used to get a unique key for each grouping.
Lead function allows fetching the date of the next row to use that as an enddate, the same lead function can be used to check if this is the latest and then current row.
Solution 2 ✔
The second solution is to use window functions:
SELECT
surrogate_key_int64() OVER (PARTITION BY id, date) as key,
id,
date,
COALESCE(LEAD(date) OVER (PARTITION BY id ORDER BY date), '9999-12-31') AS EndDate,
LEAD(date) OVER (PARTITION BY id ORDER BY date) IS NULL AS IsCurrent,
c1,
c2,
c3
FROM accounts
WHERE ROW_NUMBER() OVER(PARTITION BY id, date ORDER BY SentTimestamp DESC) = 1
The idea is that the WHERE filter picks out the latest row for each date, then the projection assigns a surrogate key and picks out the end dates and checks is current. This solution requires changing the window operator to only use persistent tree if it is required, since surrogate_key_int64 would not require it per row, only per partition key.
Chosen solution
Solution 2 was chosen because of its overall simplicity compared to solution 1. last_value_agg is not required, and in terms of storage they are quite similar. Keeping references to the columns allows more easy usability.
SCD depending on another SCD table
Another issue is how to solve updating a SCD when it needs to join with another SCD table, and make sure it updates correctly with the correct dates.
To solve this the following solution is proposed:
CREATE VIEW accountsdata AS
SELECT
a.*
b.id,
list_sort_asc_null_last(list(a.transactionDate, greatest(a.transactionDate, b.effectiveDate))) as dates
FROM accounts a
INNER JOIN brokers b
ON a.brokerid = b.id AND b.effectiveDate <= a.endDate AND b.endDate >= a.transactionDate;
SELECT
surrogate_key_int64() OVER (PARTITION BY id, effectivedate) as key,
id,
effectivedate,
COALESCE(LEAD(effectivedate) OVER (PARTITION BY id ORDER BY effectivedate), '9999-12-31') AS EndDate,
LEAD(effectivedate) OVER (PARTITION BY id ORDER BY effectivedate) IS NULL AS IsCurrent
FROM (
SELECT
*,
list_first_difference(dates, LAG(dates) OVER (PARTITION BY a.id ORDER BY dates)) as effectivedate
FROM accountsdata
)
A list of dates from all the dependencies and itself is stored sorted. The first difference from the previous row is then selected as the new effective date. This means if there are multiple rows from joining with the brokers table above, they will be added in order.