data-diff
data-diff copied to clipboard
Support Databases that don't have md5 or fast md5 (MSSQL)
I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.
Cheers
Hello!
Yes, we originally intended to include MSSQL, and even wrote the code for it. But the md5 function ended up being way too slow (x100 slower than postgres iirc). We considered it unusable, so we decided to put it aside for now.
If we find a solution to the performance issue, I'll be happy to add MSSQL officially.
Hey @pnelson-de, thanks for writing in :)
In extension of Erez' reply, something he and I have discussed is that we will need to support a variety of databases where md5
hashing aggregation not supported. ElasticSearch is another example. For those, we may have to 'negotiate' a weaker form of hashing, such as simply a sum
.
However, it's not a super safe default as it's not entirely unlikely you'll end up with the same number for segments even if the data is wrong...
However, there's already plenty of types in play that need to hash correctly to the same values across databases. 😅 Once the testing infrastructure is in place for that, we would look at databases that don't have proper support for checksumming.
Guessing https://orderbyselectnull.com/2018/05/31/hashbytes-scalability/ is related / similar problem. Was looking to do an Oracle MSSQL check with this tool myself so I'll +1 this. I don't have a better idea for you all though at this moment.
Took a quick peak at https://docs.microsoft.com/en-us/sql/t-sql/functions/checksum-agg-transact-sql?view=sql-server-2017 but I'm not confident
@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?
@visch the problem with CHECKSUM_AGG
and CHECKSUM
is they don't specify the hashing algorithm. If you can find out if it's SHA1, CRC32, etc. by testing some strings against HASHBYTES
on the same instance with some selects, then we should be able to use it and 'negotiate the algorithm' with the other databases. Similar to how we, as of, negotiate the precision of timestamp(n)
fields in https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc .
If it's not a checksum we can reproduce https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc and add support for negotiating a common algorithm, in the case of MSSQL, it'll be sum()
instead of md5()
. It shouldn't be too hard.
@visch are you looking to test Oracle MSSQL to another Oracle MSSQL?
Oracle DB to Microsoft SQL
@visch if you're up for implementing what I suggested above, let us know. You can join us in #tools-data-diff in the Locally Optimistic Slack..
If you have a supported database pair and wiling to test and provide feedback, we'd be eager to hear it :) Even just Oracle -> Oracle on the same table would be really useful for us.
Looks like CHECKSUM
uses their own fairly simple algorithm.. So not something the others would easily support. So I think we'd go with the sum()
based approach for it to start.
I don't have the time right now to look at this but in case someone wants to dive in
https://gitlab.com/vischous/oracle2mssql/-/blob/master/script.sh is a nice podman (replace that with docker if you want) script to spin up oracle / mssql dbs. Note there's a few extra commands you have to run with Oracle here https://gitlab.com/vischous/oracle2mssql/-/blob/master/oracle.sql
Just adding my vote for SQL Server support! Best of luck solving the technical issues.
+1 for MS SQL Server support
Please vote by "reacting" on the post. That makes it easier to aggregate.
Proposal for support for SQLServer
The hash: We take each column, turn its normalized form into a hex string, and then take the 16 left-most digits, 16 right-most digits, and possibly another 16 digits from its middle. We convert each of them into an int64, and together with the length of the string, we dot-product them with a constant vector of prime numbers. So length(col1) * prime1 + col1a * prime2 + col1b * prime3 + length(col2) * prime4 + col2a * prime5 + ...
and so on.
Cons:
- Might miss changes in long text values
- Unclear what are the chances of a collision. They are probably pretty low, but not as low as md5.
- The run speed will be linear to the amount of columns being compared.
Pros:
- Should be very fast, and work for all databases (on sqlserver, 25m rows of 1 column should take around 10 seconds)
- Should be good enough to catch any change in most fields, like floats and like dates.
- If we want to reduce collisions even further, we can use a better "hashing function" at the expense of some performance. For example, something like
col*(col+3) % prime
, aka the Knuth variant. These are details we can easily change and test once the rest is implemented.
To summarize, I think it might be good enough for actual use, but it's hard to measure exactly how likely it is to break down. If anyone has thoughts or ideas about this, I'd love to hear them.
My team would be very interested in having SQL Server support. I don't have background in hashing functions, so can't really speak to the best approach there, but a colleague and I would be willing to spend some time working on this. @erezsh should we chat first? Or should we just give it a go?
Hi @Elliot2718 ,
If you have any questions about the proposal I wrote, or how to approach implementing it, we can set up a chat and get you up to speed.
Keep in mind that the implementation should be done by creating a new Dialect Mixin that is compatible with https://github.com/datafold/sqeleton/ .
For example, here is the MD5 mixin implementation for postgres: https://github.com/datafold/sqeleton/blob/master/sqeleton/databases/postgresql.py#L29
And here is how it's included in data-diff: https://github.com/datafold/data-diff/blob/master/data_diff/databases/postgresql.py#L5
The implementation itself can reside in either sqeleton or data-diff, though I think for consistency it's better to put it in Sqeleton.
This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.
Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.
Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added.
Yeah this should definitely remain open. I'm unsure if/when we as maintainers would have the bandwidth to work on this, but very open to community contributions in the meantime.