delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Datafusion integration assumes table's data files are local

Open sd2k opened this issue 4 years ago • 9 comments

The Datafusion integration passes a list of file paths representing a table's actual data to Datafusion's ParquetExec, but if the Delta table's StorageBackend is anything other than the FileStorageBackend then this fails because the files aren't local.

I'm not sure where this should be handled though - it feels like this should be part of Datafusion or an extension crate?

sd2k avatar Dec 13 '20 19:12 sd2k

Yeah, unfortunately, datafusion uses arrow parquet readers, which only supports local file at the moment: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L181. I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

houqp avatar Dec 13 '20 22:12 houqp

I think this is best handled by the rust parquet reader with minor adjustments to datafusion's execution plan after that.

Makes sense to me!

@nevi-me has plans to add S3 support to the parquet reader. If you are interested in extending the reader to support S3 or other cloud storages, I would recommend collaborating with him :)

Sounds good, I'll keep an eye on it and try and contribute an Azure reader when the time comes.

sd2k avatar Dec 14 '20 08:12 sd2k

What could work in the interim is to use DataFusion's in-memory datasource (https://docs.rs/datafusion/2.0.0/datafusion/datasource/memory/index.html). When we have async-support on Parquet, then we can change to the relevant methods.

nevi-me avatar Dec 16 '20 07:12 nevi-me

@nevi-me is there a bug anywhere to track S3 support? I took a brief look in the Arrow and Datafusion repos and didn't find anything. If you're open to it it's something that we could potentially look in to contributing.

meastham avatar Jun 15 '21 13:06 meastham

@meastham feel free to start a discussion for s3 support in the upstream datafusion github repo or in the arrow dev mailing list.

houqp avatar Jun 18 '21 06:06 houqp

Given object store support in datafusion, can a blob path integration be implemented assuming we have appropriate blobstore implementation of object_store interface? https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/object_store/mod.rs

I understand that given this, we can pass the file names prefixed with appropriate storage handler name from delta-rs, but my question is, is datafusion execution plan integration with this data source complete or is it still in progress?

gopik avatar Oct 23 '21 06:10 gopik

@gopik yes, we are pending on upstream object store support for s3. datafusion execution plan integration is all complete other than partition column support, which should be fairly straight forward to add.

houqp avatar Oct 23 '21 06:10 houqp

@houqp When you say upstream object support for s3, will that be part of datafusion project or it'll be part of an integration that is embedding datafusion?

gopik avatar Oct 23 '21 07:10 gopik

@gopik it will be part of datafusion, see https://github.com/apache/arrow-datafusion/issues/907

houqp avatar Oct 23 '21 08:10 houqp

With the adoption of object_store, the datafusion integration now supports all storage backends - there are integration tests as well :).

https://github.com/delta-io/delta-rs/blob/main/rust/tests/integration_datafusion.rs

roeap avatar Sep 02 '22 06:09 roeap