datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Explore integration with Delta Lake

Open sunchao opened this issue 1 year ago • 5 comments

What is the problem the feature request solves?

Comet currently only support either Spark's built-in data sources, or Iceberg (WIP). We should also consider supporting Delta Lake in future especially given it already has a Rust implementation delta-rs. To achieve that, however, we may need to first move away from our hybrid Parquet reader implementation to a fully native one.

cc @dennyglee per our discussion

Describe the potential solution

Integrate Comet with delta-rs, so Spark queries reading from Delta Lake tables can also leverage Comet native execution.

Additional context

No response

sunchao avatar Mar 07 '24 17:03 sunchao

I'm thinking if we move to fully native reader, does it mean we need to drop current JVM-based source (built-in and Iceberg)? Or we are going to have two types of readers? I ask this because I think they might affect how we handle native execution and they might be conflicting each other. (maybe not, if we have different native source operator. 🤔 )

viirya avatar Mar 07 '24 18:03 viirya

... does it mean we need to drop current JVM-based source (built-in and Iceberg)?

TBH I don't have concrete ideas on how the switch to fully native Parquet read will look like at the moment. But, we definitely still need to support these after the migration. We might want to keep both implementations around for certain time before the new implementation get matured, and eventually remove the old one.

sunchao avatar Mar 07 '24 18:03 sunchao

Hello friends! Checking in from the delta-rs project :smile: We rely heavily on the parquet crate which now has support for almost all the data types I have every seen in the wild with Apache Parquet. We recently broke up subcrates so if you have your own file reading tools, pulling in deltalake-core will give you the smallest dependency surface area necessary to process a Delta table.

We do also publish deltalake-aws, deltalake-azure, deltalake-gcp which have storage specific requirements handled within them, such as some of the silly hacks to work around atomic renames in S3.

You can also take the metacrate deltalake if you want the whole bucket of fun :smile:

rtyler avatar Mar 08 '24 06:03 rtyler

Now that Comet supports DataFusion's DataSourceExec (when native_datafusion scan is enabled) it should be much easier to support delta-rs.

andygrove avatar Apr 03 '25 13:04 andygrove

Oh brilliant! Thanks!

dennyglee avatar Apr 03 '25 16:04 dennyglee