Diff using joins when both tables are within the same database.

Open erezsh opened this issue 1 year ago • 3 comments

Describe the solution you'd like

When data-diff sees both tables are in the same database, it may switch to an alternative, more efficient algorithm that uses joins, and therefor doesn't download any unnecessary data.

Aug 16 '22 15:08 erezsh

On one hand, that sounds like a good idea in the abstract.

On the other, if you have two tables in the same database containing (largely) the same data, that's indicative of some serious problems per basic relational theory. Is it actually likely that a case like this would see much use?

Aug 16 '22 17:08 masonwheeler

It could be used, for example, to validate short-term backups, or data versioning.

Also, since we'll be using joins instead of hashing, large diffs won't have such a big effect on our performance. So the almost-same-data limitation no longer applies.

Aug 16 '22 18:08 erezsh

Plan

Make sure there are no duplicate keys or NULL keys
Split the tables into small segments, like in our checksum diff, so we can query from multiple threads
(maybe) Checksum each segment pair before joining, for better performance if they are equal?
For each pair of segments, use FULL OUTER JOIN on the keys. Missing rows will produce NULL keys. When FULL OUTER JOIN isn't available (MySQL), we can use LEFT JOIN .. UNION ALL .. RIGHT JOIN ON left_table = NULL, which should produce the same result, as there are no duplicate keys.
Compare the columns using IS DISTINCT FROM, or equivalent. (<=> in MySQL, etc.), which treats NULL as a distinct value, like Python. If it's not supported by the db, we can fall back to the (a=b OR (a is null AND b is null)) pattern. Return 0 if equal, or 1 if different.
Count the diff for each column using SUM(is_diff_col_N). Count the total diff using sum(is_diff_col_N OR is_diff_col_N+1 OR ...),
Collect additional aggregations, such as sum of values. Open question: Do it in the same query, or in a separate query?

Algorithm draft

Here is how it might look in Python code.

(The actual code could end up looking quite different)


# init
pk_columns = ('id1', 'id2', ...)
extra_columns = ('timestamp', ...)
a = segment1
b = segment2

# diff
j = outerjoin(a, b).on(a[pk_columns] == b[pk_columns])

diffed = j.select(
    '*',
    **{f"is_diff_col_{c}": is_distinct_from(a[c], b[c]) for c in extra_columns}
)

diffed.select(
    **{f"total_diff_col_{c}": sum(diffed[f"is_diff_col_{c}"] for c in extra_columns},

    total_diff = sum(or_(diffed[f"is_diff_col_{c}"] for c in extra_columns )),

    # stats - separate query?
    total1 = count(a),
    total2 = count(b),
    **{f"sum1_col_{c}": sum(a[c]) for c in extra_columns}
    **{f"sum2_col_{c}": sum(b[c]) for c in extra_columns}
)

Aug 30 '22 10:08 erezsh

Implemented.

Oct 14 '22 12:10 erezsh

data-diff data-diff copied to clipboard

Diff using joins when both tables are within the same database.

Plan

Algorithm draft

data-diff
data-diff copied to clipboard