iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Integrate with datafusion

Open ZENOTME opened this issue 1 year ago • 4 comments

After support basic scan and catalog, we can consider to integrate with datafusion to speed up data driven tests.

ZENOTME avatar Mar 09 '24 14:03 ZENOTME

I'm working the draft so that we can have more clear discussion above it. Blocked by #277 now.

ZENOTME avatar Mar 17 '24 12:03 ZENOTME

@ZENOTME I'm interested in your approach, perhaps you can outline what you are going to do (high-level). I'm just curious and want to understand / research where those integrations points / interfaces might be? Thanks in advance and best regards.

marvinlanhenke avatar Mar 21 '24 13:03 marvinlanhenke

Thanks for raising this discussion @marvinlanhenke! The basic idea for the integration is to provide the wrap struct using type in iceberg-rust so that users can use them to connect with datafusion directly.

Implementation outline

1. Implement trait for managing the iceberg table.

The datafusion provides the following trait to manage the table:

  • CatalogProviderList
  • CatalogProvider
  • SchemaProvider
  • TableProvider

We can map them into the type in iceberg-rs

  • CatalogProviderList: Maybe we don't need to implement this
  • CatalogProvider: Catalog in iceberg-rust
  • SchemaProvider: Namespace in iceberg-rust
  • TableProvider: Table in iceberg-rust

We can implement them by wrapping using type in iceberg-rs internally.

Like

struct IcebergCatalogProvider {
  inner: iceberg_rs::Catalog
}

impl CatalogProvider for IcebergCatalogProvider {
  ...
}

2. Implement the trait for scanning the table.

And we also need to implement an ExecutionPlan for scan in TableProvider. This part we can rely on TableScan in iceberg-rs

Feel free to any suggestions and if something can be improved. Please let me know if there is something confusing.

ZENOTME avatar Mar 21 '24 14:03 ZENOTME

The datafusion provides the following trait to manage the table:

  • CatalogProviderList
  • CatalogProvider
  • SchemaProvider
  • TableProvider

Thank you so much for taking the time and making the effort to outline the approach.

I just was looking for those traits you mentioned. The rest is basically (over-simplified) just providing an Adapter, which is reasonable and easy to understand.

Perhaps, one more question though to clarify or to solidify my understanding...

...we would have to add datafusion as a dependency and implement those traits on our side, in order to provide the specific implementation of a CatalogProvider e.g. for the HiveMetastore. Then, a user can add our 'catalog provider' crate to their project alongside datafusion and use our provider. Is that correct?

Thanks again for explaining the approach.

marvinlanhenke avatar Mar 21 '24 17:03 marvinlanhenke

Basic integration has been added in https://github.com/apache/iceberg-rust/pull/324.

We can create a new tracking issues for missing pieces.

Xuanwo avatar Aug 19 '24 16:08 Xuanwo