data-diff
data-diff copied to clipboard
Diff using joins when both tables are within the same database.
Describe the solution you'd like
When data-diff sees both tables are in the same database, it may switch to an alternative, more efficient algorithm that uses joins, and therefor doesn't download any unnecessary data.
On one hand, that sounds like a good idea in the abstract.
On the other, if you have two tables in the same database containing (largely) the same data, that's indicative of some serious problems per basic relational theory. Is it actually likely that a case like this would see much use?
It could be used, for example, to validate short-term backups, or data versioning.
Also, since we'll be using joins instead of hashing, large diffs won't have such a big effect on our performance. So the almost-same-data limitation no longer applies.
Plan
-
Make sure there are no duplicate keys or NULL keys
-
Split the tables into small segments, like in our checksum diff, so we can query from multiple threads
-
(maybe) Checksum each segment pair before joining, for better performance if they are equal?
-
For each pair of segments, use
FULL OUTER JOIN
on the keys. Missing rows will produce NULL keys. When FULL OUTER JOIN isn't available (MySQL), we can useLEFT JOIN .. UNION ALL .. RIGHT JOIN ON left_table = NULL
, which should produce the same result, as there are no duplicate keys. -
Compare the columns using
IS DISTINCT FROM
, or equivalent. (<=>
in MySQL, etc.), which treats NULL as a distinct value, like Python. If it's not supported by the db, we can fall back to the(a=b OR (a is null AND b is null))
pattern. Return 0 if equal, or 1 if different. -
Count the diff for each column using
SUM(is_diff_col_N)
. Count the total diff usingsum(is_diff_col_N OR is_diff_col_N+1 OR ...),
-
Collect additional aggregations, such as sum of values. Open question: Do it in the same query, or in a separate query?
Algorithm draft
Here is how it might look in Python code.
(The actual code could end up looking quite different)
# init
pk_columns = ('id1', 'id2', ...)
extra_columns = ('timestamp', ...)
a = segment1
b = segment2
# diff
j = outerjoin(a, b).on(a[pk_columns] == b[pk_columns])
diffed = j.select(
'*',
**{f"is_diff_col_{c}": is_distinct_from(a[c], b[c]) for c in extra_columns}
)
diffed.select(
**{f"total_diff_col_{c}": sum(diffed[f"is_diff_col_{c}"] for c in extra_columns},
total_diff = sum(or_(diffed[f"is_diff_col_{c}"] for c in extra_columns )),
# stats - separate query?
total1 = count(a),
total2 = count(b),
**{f"sum1_col_{c}": sum(a[c]) for c in extra_columns}
**{f"sum2_col_{c}": sum(b[c]) for c in extra_columns}
)
Implemented.