data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

Support Databases that don't have md5 or fast md5 (MSSQL)

Open pnelson-de opened this issue 2 years ago • 18 comments

I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.

Cheers

pnelson-de avatar Jun 08 '22 18:06 pnelson-de

Hello!

Yes, we originally intended to include MSSQL, and even wrote the code for it. But the md5 function ended up being way too slow (x100 slower than postgres iirc). We considered it unusable, so we decided to put it aside for now.

If we find a solution to the performance issue, I'll be happy to add MSSQL officially.

erezsh avatar Jun 09 '22 08:06 erezsh

Hey @pnelson-de, thanks for writing in :)

In extension of Erez' reply, something he and I have discussed is that we will need to support a variety of databases where md5 hashing aggregation not supported. ElasticSearch is another example. For those, we may have to 'negotiate' a weaker form of hashing, such as simply a sum.

However, it's not a super safe default as it's not entirely unlikely you'll end up with the same number for segments even if the data is wrong...

However, there's already plenty of types in play that need to hash correctly to the same values across databases. 😅 Once the testing infrastructure is in place for that, we would look at databases that don't have proper support for checksumming.

sirupsen avatar Jun 09 '22 09:06 sirupsen

Guessing https://orderbyselectnull.com/2018/05/31/hashbytes-scalability/ is related / similar problem. Was looking to do an Oracle MSSQL check with this tool myself so I'll +1 this. I don't have a better idea for you all though at this moment.

Took a quick peak at https://docs.microsoft.com/en-us/sql/t-sql/functions/checksum-agg-transact-sql?view=sql-server-2017 but I'm not confident

visch avatar Jun 13 '22 14:06 visch

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

sirupsen avatar Jun 13 '22 16:06 sirupsen

@visch the problem with CHECKSUM_AGG and CHECKSUM is they don't specify the hashing algorithm. If you can find out if it's SHA1, CRC32, etc. by testing some strings against HASHBYTES on the same instance with some selects, then we should be able to use it and 'negotiate the algorithm' with the other databases. Similar to how we, as of, negotiate the precision of timestamp(n) fields in https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc .

If it's not a checksum we can reproduce https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc and add support for negotiating a common algorithm, in the case of MSSQL, it'll be sum() instead of md5(). It shouldn't be too hard.

sirupsen avatar Jun 13 '22 16:06 sirupsen

@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?

Oracle DB to Microsoft SQL

visch avatar Jun 13 '22 16:06 visch

@visch if you're up for implementing what I suggested above, let us know. You can join us in #tools-data-diff in the Locally Optimistic Slack..

If you have a supported database pair and wiling to test and provide feedback, we'd be eager to hear it :) Even just Oracle -> Oracle on the same table would be really useful for us.

sirupsen avatar Jun 13 '22 16:06 sirupsen

Looks like CHECKSUM uses their own fairly simple algorithm.. So not something the others would easily support. So I think we'd go with the sum() based approach for it to start.

sirupsen avatar Jun 13 '22 16:06 sirupsen

I don't have the time right now to look at this but in case someone wants to dive in

https://gitlab.com/vischous/oracle2mssql/-/blob/master/script.sh is a nice podman (replace that with docker if you want) script to spin up oracle / mssql dbs. Note there's a few extra commands you have to run with Oracle here https://gitlab.com/vischous/oracle2mssql/-/blob/master/oracle.sql

visch avatar Jun 13 '22 16:06 visch

Just adding my vote for SQL Server support! Best of luck solving the technical issues.

joshcrichman avatar Jun 28 '22 22:06 joshcrichman

+1 for MS SQL Server support

rhysla avatar Jul 27 '22 23:07 rhysla

Please vote by "reacting" on the post. That makes it easier to aggregate.

erezsh avatar Jul 28 '22 07:07 erezsh

Proposal for support for SQLServer

The hash: We take each column, turn its normalized form into a hex string, and then take the 16 left-most digits, 16 right-most digits, and possibly another 16 digits from its middle. We convert each of them into an int64, and together with the length of the string, we dot-product them with a constant vector of prime numbers. So length(col1) * prime1 + col1a * prime2 + col1b * prime3 + length(col2) * prime4 + col2a * prime5 + ... and so on.

Cons:

  • Might miss changes in long text values
  • Unclear what are the chances of a collision. They are probably pretty low, but not as low as md5.
  • The run speed will be linear to the amount of columns being compared.

Pros:

  • Should be very fast, and work for all databases (on sqlserver, 25m rows of 1 column should take around 10 seconds)
  • Should be good enough to catch any change in most fields, like floats and like dates.
  • If we want to reduce collisions even further, we can use a better "hashing function" at the expense of some performance. For example, something like col*(col+3) % prime , aka the Knuth variant. These are details we can easily change and test once the rest is implemented.

To summarize, I think it might be good enough for actual use, but it's hard to measure exactly how likely it is to break down. If anyone has thoughts or ideas about this, I'd love to hear them.

erezsh avatar Sep 24 '22 07:09 erezsh

My team would be very interested in having SQL Server support. I don't have background in hashing functions, so can't really speak to the best approach there, but a colleague and I would be willing to spend some time working on this. @erezsh should we chat first? Or should we just give it a go?

Elliot2718 avatar Jan 11 '23 21:01 Elliot2718

Hi @Elliot2718 ,

If you have any questions about the proposal I wrote, or how to approach implementing it, we can set up a chat and get you up to speed.

Keep in mind that the implementation should be done by creating a new Dialect Mixin that is compatible with https://github.com/datafold/sqeleton/ .

For example, here is the MD5 mixin implementation for postgres: https://github.com/datafold/sqeleton/blob/master/sqeleton/databases/postgresql.py#L29

And here is how it's included in data-diff: https://github.com/datafold/data-diff/blob/master/data_diff/databases/postgresql.py#L5

The implementation itself can reside in either sqeleton or data-diff, though I think for consistency it's better to put it in Sqeleton.

erezsh avatar Jan 12 '23 00:01 erezsh

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

github-actions[bot] avatar Jun 08 '23 06:06 github-actions[bot]

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

joshcrichman avatar Jun 08 '23 07:06 joshcrichman

Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.

Yeah this should definitely remain open. I'm unsure if/when we as maintainers would have the bandwidth to work on this, but very open to community contributions in the meantime.

dlawin avatar Jun 26 '23 16:06 dlawin