DeltaLake Table Source offline store implementation
Is your feature request related to a problem? Please describe.
Scalable data stores—Data warehouse and modern Data Lakes—are common today for offline store implementations or as a source of data ingestion. These stores are optimized and support ACID functionality where clean data can reside, and new data constantly can be updated and merged, accommodating data changes and observing schema evolutions. This ability to ingest data from or provide an offline store as a feast.data_source.FileSource from these stores extends
Feast's ecosystem to modern data lakes such as Delta Lake, Apache Hudi, or Apache Iceberg.
A similar feature request for HudiTableSource has been filed by @blvp
Describe the solution you'd like
Extend feast.data_source.FileSource(...) to take table names and locations to read from, for both local or remote sources.
Describe alternatives you've considered
I would have to save my Delta Lake tables as a single parquet file and use that as FileSource, which may defeat the purpose of being able to ingest point-in-time data from these modern data lake sources.
Just swinging back here @dmatrix. We'd love to support the sources you've laid out above. I think the biggest challenge from our side is just how many storage implementations we should support out of the box, and how many should be community contributed. We're trying to strike a balance here. Efforts like increased pluggability will be our short term solution, but I'd love to figure out if there are low hanging fruit that we haven't identified.
@woop Yes, it's a matter of striking a balance and compromise: support a data source out of the box that's has wider adoption, reliable vendor support, and large user or contributing community. Yet we cannot overlook the growing adoption of modern data lakes, based on open file formats, as enterprises' central repository for all their data. So it may make sense to support one of these data sources.
As part of an easily extensible Feast ecosystem, the idea of making data sources as pluggable architecture is viable and attractive short term solution. Perhaps, we can solicit community help by labeling and tagging this issue as "Community Help Needed" or "Need Community Help."
Happy to take a crack at this. I have had a little dive into the code source.
I can see that there is a redshift and redshift_source file and same with bigquery. Would I just need to reimplement that?
Looking at the source code for the Spark offline store, support for delta should be as easy as listing it in the SparkSourceFormat. Reading the table is done (as it should) with df = spark_session.read.format(self.file_format).load(self.path). Reading a delta table is a matter of having .format("delta") and Spark will take care of the rest.