Add support for Athena
Hey, Currently we're trying to validate data stored in S3 Parquets, that are exposed via Glue/Athena. Currently, we can download the parquets, load it into duckdb, then use the DuckDB connector to do a Joindiff. But what if instead of doing this, we'd use Athena to calculate TableSegments, and use that information to do a hashdiff?
The most important question is, can we do implement the hashing & checksum queries in Athena? I saw that presto is somewhat supported, but not sure about the details.
Yes, Presto is supported. I am not familiar with the particulars of Athena, but for a database to be suitable for Reladiff it has to:
- have a fast md5 operation
- have indexes on the keys
- have fast min/max operations on the keys (usually comes with indexes)
For example SQL-Server isn't supported because its md5 function is too slow, i.e. 100x slower than postgres'
If Athena has all these features, I think it should be possible to implement it.
Anyone wondering, I'm started working on this issue, but no ETA for now.