reladiff icon indicating copy to clipboard operation
reladiff copied to clipboard

Add support for Athena

Open vmatt opened this issue 1 year ago • 2 comments

Hey, Currently we're trying to validate data stored in S3 Parquets, that are exposed via Glue/Athena. Currently, we can download the parquets, load it into duckdb, then use the DuckDB connector to do a Joindiff. But what if instead of doing this, we'd use Athena to calculate TableSegments, and use that information to do a hashdiff?

The most important question is, can we do implement the hashing & checksum queries in Athena? I saw that presto is somewhat supported, but not sure about the details.

vmatt avatar Nov 08 '24 15:11 vmatt

Yes, Presto is supported. I am not familiar with the particulars of Athena, but for a database to be suitable for Reladiff it has to:

  • have a fast md5 operation
  • have indexes on the keys
  • have fast min/max operations on the keys (usually comes with indexes)

For example SQL-Server isn't supported because its md5 function is too slow, i.e. 100x slower than postgres'

If Athena has all these features, I think it should be possible to implement it.

erezsh avatar Nov 08 '24 20:11 erezsh

Anyone wondering, I'm started working on this issue, but no ETA for now.

vmatt avatar Apr 01 '25 18:04 vmatt