data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

Support for NoSQL/document-based databases?

Open ashrielbrian opened this issue 1 year ago • 5 comments

We do a lot of data ingestion/syncing from Firestore and MongoDB into BigQuery. Any plans in the roadmap for these kinds of DBs? I assume in order to support other database types, they would need to support running hashing algorithms (md5, sha) natively.

ashrielbrian avatar Jul 06 '22 03:07 ashrielbrian

So ideally, they support querying with an MD5 function. But, if they don't, you can still get chunks of data and do the MD5 on it outside of the database - it just isn't as fast/efficient. I have successfully done this with NoSQL databases, including MongoDB and Elasticsearch. I am hoping once we establish a pattern for these, we can get MongoDB and Elasticsearch on the shortlist, because I think they will be some of the most popular (as well as DynamoDB, and Firestore and BigQuery as you mention).

mrn3 avatar Jul 06 '22 05:07 mrn3

Hello. Thanks for asking. We have this task on our roadmap, but it is not in the highest priority, closer to the end of the list: firstly, because of our assumption that this is a rare use case; and secondly, because of a lack of users to test it.

Yes, you are right: the absence of hashes on the server side is the most significant challenge there. It breaks the main trick of why data-diff is fast on big data. See also #51.

However, there is good news: we are planning on reworking the database connecting machinery rather sooner than later — this task is closer to the top of our list. With that, users will be able to write their custom connectors that calculate the hashes "somehow" (including by downloading the full dataset locally). This might partially solve the problem.

We are discussing the roadmap and priorities right now — it will become more clear in the coming days or weeks. I will include your use case there.

nolar avatar Jul 06 '22 06:07 nolar

Thanks for the replies @mrn3 @nolar!

Yeah - I thought of having the hashes running locally, but then that defeats the purpose of data-diff's performance advantage, with only hashes being transmitted over the network, as opposed to all the rows.

Another issue I can think of is the lack of schema/enforced data types in NoSQL documents, which I understand is important when comparing hashes.

As for it being a rare use-case, our application team handles millions of users on document-based DBs, and we ingest them into columnar-style data warehouses for our analytics teams. So I am convinced it's not that rare of a use-case. Hope to see it pushed higher up the list, as I'm thoroughly impressed by the cleverness of this tool. Thanks again for all your hard work and making this open source!

ashrielbrian avatar Jul 06 '22 08:07 ashrielbrian

Just a note - it may be possible to infer the schema from looking at the data, at least in some cases. We currently do this to detect UUIDs, which the schema usually reports as varchar.

But I agree, having to download all the rows would nullify the usefulness of this tool.

erezsh avatar Jul 06 '22 09:07 erezsh

So I have done this with MongoDB and Elasticsearch in the past using functions in these files (the Leo Platform is open source):

https://github.com/LeoPlatform/connectors/blob/master/elasticsearch/lib/checksum.js https://github.com/LeoPlatform/connectors/blob/master/mongo/lib/checksum.js

As far as the schema goes, you can infer it, but we just provided configuration (and even transformation functions) as part of the input. All you are really trying to do is take chunk of data from a source system, run it through the same transformations that your pipeline does, do a MD5 on it, and then compare it to the MD5 on the same data in the target data store.

While doing the MD5 outside of the database is quite a bit slower, I wouldn't say doing it "nullifies the usefulness of this tool". There is still a lot of value in the tool, even if it runs slower when using data stores that don't support MD5 function in database queries.

I also agree that this isn't a "rare use case". Most of our primary data stores we want to compare to Snowflake and Elasticsearch are MongoDB, so we have quite a bit of "NoSQL" to deal with.

mrn3 avatar Jul 06 '22 17:07 mrn3

Also very interested in this use case, we have MongoDB + AWS DocumentDB source systems we would love to be able to diff against Redshift.

menzenski avatar Mar 09 '23 15:03 menzenski

Yeah it definitely seems like MongoDB will come up a lot since it is pretty widely used.

mrn3 avatar Mar 10 '23 13:03 mrn3

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

github-actions[bot] avatar May 28 '23 06:05 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

github-actions[bot] avatar Jun 04 '23 06:06 github-actions[bot]

Commenting to re-open this issue - still very interested in this use case

menzenski avatar Mar 13 '24 18:03 menzenski