datafusion
datafusion copied to clipboard
FFI initial implementation
Which issue does this PR close?
This is to address part of https://github.com/apache/datafusion-python/issues/823 downstream but may have wider application than just python.
Rationale for this change
This PR allows for registering table providers via a stable FFI. With this change it enables breaking the requirement for python providers to include all of datafusion-python and re-export it. With this change we can allow providers with different underlying datafusion versions to interoperate.
What changes are included in this PR?
Adds support for TableProvider via FFI. In order to support this, it also includes ExecutionPlan, SessionConfig, PlanProperties, and TableType. As this gets used more, I expect we will want to expose other features but this gives an initial first implementation that solves an immediate need.
Are these changes tested?
Some unit tests are provided. Additionally I did the following test:
I created a separate crate with the contents of datafusion/ffi so that I can test it against different versions of DataFusion by modifying the dependencies in Cargo.toml. Then I used this crate to build a test implementation of datafusion-python against DataFusion 42.0.0. I adjusted the test crate and built a test implementation of delta-rs against DataFusion 41.0.0. Then I registered the delta table in python against the session context. I was able to query the table with push down filters via this FFI interface even though the underlying DataFusion versions were different.
Additionally I ran memory leak checks against the provided unit tests and against running in python.
Are there any user-facing changes?
This is not breaking, but a pure addition of a new datafusion-ffi library.
Remaining Issues
- [x] There is some inconsistency between the usage of
ExportedXYZand just using the rawFFI_XYZ. We should make the usage consistent across all struct types. - [x] Add documentation to explain the reasoning behind creating the data the way we do in the private data and foreign structs.
- [x] Add documentation to explain more clearly the delineation between the
ExportedXYZandForeignXYZ. It would probably be good to have a use case since which is "foreign" and which is "exported" can be complicated during some of the function calls.
In response to https://github.com/apache/datafusion/issues/12357 I believe this PR should go into the core DataFusion because it does allow for extension of core features and enabling of runtime loading of libraries. However the driving case is datafusion-python so I would be comfortable moving it either there or to a datafusion-contrib repository.
I see this PR and plan to review it, hopefully today but likely tomorrow (Mon)
Thanks for the feedback! I'll try to address the remaining concerns in the next few days.
After you are done I'll gladly give it another look! 🚀
Adding issue https://github.com/apache/datafusion/issues/13175 to track additional documentation. I'd like to merge this in and work on that (I've assigned it to myself) so we can unblock datafusion-python and delta-rs
Thank you for the reviews. I just did a quick rebase so the merge can go through. I'll get started on that end to end example in a different PR.