data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

Let the differ choose a shared hashing algorithm

Open erezsh opened this issue 2 years ago • 7 comments

Right now we only support md5 for hashing columns.

However, for some databases that might not be feasible. For example, in mssql it's too slow, and sounds like Spanner only supports sha1.

I think the solution should be to allow several implementations in subclasses of Database, and have the differ choose the best one that's shared between them. (or throw an error if there isn't one)

See issues #51 and #99.

erezsh avatar Jun 23 '22 15:06 erezsh

Agree. sha1 is also sometimes faster than md5, because some instruction-sets have specific instructions for it, but not for md5. crc32 is also generally faster than both of those.

sirupsen avatar Jun 23 '22 16:06 sirupsen

I think this should be broken down into 2 tasks:

  1. Adding a list of supported algorithms to the base/child classes, and refactor md5_to_string to something generic (maybe something like hash_to_string?)
  2. Adding these changes to the interface/CLI where the DB type determines the intersection of hashing algorithms and offers a choice (and/or a default)

I'd be happy to start working on this, if it hasn't been picked up already.

abhinav1912 avatar Jun 27 '22 03:06 abhinav1912

@abhinav1912 it'd be awesome if you could pick this up!

If you can do some research for (1) about what kinds of "alternate" negotiations we can do, and how you'd implement it, that'd be a great start before you start implementing it 👀 For example, for MSSQL, how can we do strings if we can't do md5 fast?

sirupsen avatar Jun 27 '22 13:06 sirupsen

@sirupsen good point, I'll start looking into the alternatives and probably create a table of algorithms for each database. Will reach out on the Slack channel to clarify a few questions though. As for MSSQL, here's a good article highlighting the performance of each hashing algorithm. CHECKSUM has good perfomance but poor hashing, MD/SHA algorithms have worse performance but better hashing. Doing better than MD5 does seem a little difficult imo.

abhinav1912 avatar Jun 28 '22 03:06 abhinav1912

That MD5 performance doesn't look that bad... I wonder why it came up so slow in our tests. Maybe the instance just didn't have enough CPU? Might be worth confirming that result first by writing a basic driver!

sirupsen avatar Jun 28 '22 14:06 sirupsen

The performance point raises another question: is there be some script/utility that can calculate perf metrics for the current build? If not, then this could be a good addition.

abhinav1912 avatar Jun 28 '22 15:06 abhinav1912

@abhinav1912 working on it, will tag you in the PR — it's what generated what's in the README. Basically re-uses the type suite (otherwise usually benchmark suites get out of date)

sirupsen avatar Jun 28 '22 15:06 sirupsen

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

github-actions[bot] avatar Jun 01 '23 06:06 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

github-actions[bot] avatar Jun 09 '23 06:06 github-actions[bot]