datafusion
datafusion copied to clipboard
Break datafusion-catalog code into its own crate
Is your feature request related to a problem or challenge?
As we start thinking of adding new catalog support in DataFusion for example, @goldmedal tried to do this for DynamicFileCatalog in #11035 , the current code structure requires that any such implementation be in datafusion/core which is already quite large and monolithic
Also, I think long term it would be a more sustainable structure if we can move ListingTable out of the core as well
However, this move is not likely possible until we move table provider and related catalog traits out of the core.
Also, if we could split up datafusion core more, it would be easier for people to use DataFusion as a smaller embedded library (aka not bring in file format support if they didn't need it)
Describe the solution you'd like
Ideally I would like the following traits / code moved into a datafusion-catalog crate:
TableProviderCatalogProviderSchemaProviderMemoryCatalogProviderInformationSchemaProvider
Describe alternatives you've considered
No response
Additional context
No response
I looked into this --
One of the major challenges is that SessionState's constructor basically installs the "pre-provided" functionality (like data sources, etc
https://github.com/apache/datafusion/blob/78055fe13b64cdfb4accd131e9a462770c21a934/datafusion/core/src/execution/session_state.rs#L190-L295
One way to handle this would be to allow constructing SessionState with the minimal built in features, and have a function in SessionContext like SessionContext::register_built_ins that would register things like listing table, information schema, etc.
That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔
I made https://github.com/apache/datafusion/pull/11183 to start breaking apart the API and implementation -- there is still a ways to go
@alamb I took a stab at moving parquet functionality into a datafusion-parquet crate (#11188) , and I ran into similar challenges you highlight here. I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.
I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.
I agree with this entirely
The center of the knot is SessionState I think -- figuring out how to get that out of the core is likely key to breaking things up reasonably
The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense
So like
let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here
The more I think about this the more I think trying to make
SessionStatea container that doesn't have all the optional features (like parquet support) by default makes senseSo like
let session_state = SessionState::new(); // no table providers, etc // install standard built in table providers SessionContex::install_built_in(&mut session_state); // now session_state has them here
I filed https://github.com/apache/datafusion/issues/11320 to track this idea
I think this is basically complete, so closing it down