feast icon indicating copy to clipboard operation
feast copied to clipboard

DeltaLake Table Source offline store implementation

Open dmatrix opened this issue 4 years ago • 4 comments

Is your feature request related to a problem? Please describe. Scalable data stores—Data warehouse and modern Data Lakes—are common today for offline store implementations or as a source of data ingestion. These stores are optimized and support ACID functionality where clean data can reside, and new data constantly can be updated and merged, accommodating data changes and observing schema evolutions. This ability to ingest data from or provide an offline store as a feast.data_source.FileSource from these stores extends Feast's ecosystem to modern data lakes such as Delta Lake, Apache Hudi, or Apache Iceberg.

A similar feature request for HudiTableSource has been filed by @blvp

Describe the solution you'd like Extend feast.data_source.FileSource(...) to take table names and locations to read from, for both local or remote sources.

Describe alternatives you've considered I would have to save my Delta Lake tables as a single parquet file and use that as FileSource, which may defeat the purpose of being able to ingest point-in-time data from these modern data lake sources.

dmatrix avatar May 05 '21 00:05 dmatrix

Just swinging back here @dmatrix. We'd love to support the sources you've laid out above. I think the biggest challenge from our side is just how many storage implementations we should support out of the box, and how many should be community contributed. We're trying to strike a balance here. Efforts like increased pluggability will be our short term solution, but I'd love to figure out if there are low hanging fruit that we haven't identified.

woop avatar Jul 05 '21 16:07 woop

@woop Yes, it's a matter of striking a balance and compromise: support a data source out of the box that's has wider adoption, reliable vendor support, and large user or contributing community. Yet we cannot overlook the growing adoption of modern data lakes, based on open file formats, as enterprises' central repository for all their data. So it may make sense to support one of these data sources.

As part of an easily extensible Feast ecosystem, the idea of making data sources as pluggable architecture is viable and attractive short term solution. Perhaps, we can solicit community help by labeling and tagging this issue as "Community Help Needed" or "Need Community Help."

dmatrix avatar Jul 06 '21 01:07 dmatrix

Happy to take a crack at this. I have had a little dive into the code source.

I can see that there is a redshift and redshift_source file and same with bigquery. Would I just need to reimplement that?

Data-drone avatar Sep 30 '21 23:09 Data-drone

Looking at the source code for the Spark offline store, support for delta should be as easy as listing it in the SparkSourceFormat. Reading the table is done (as it should) with df = spark_session.read.format(self.file_format).load(self.path). Reading a delta table is a matter of having .format("delta") and Spark will take care of the rest.

creativedutchmen avatar Jun 04 '22 12:06 creativedutchmen