datafusion
datafusion copied to clipboard
Break datafusion crate into smaller crates
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
a new feature: to break datafusion crate into separate smaller crates.
It helps with code management and dependency reasoning
Describe the solution you'd like
- [x] #1752
- [x] #1751
- [x] #1758
- [x] #1760
- [x] #1753
- [x] #1759
- [x] #1761
- [x] #1763
- [x] #1764
- [x] #1765
- [x] #1762
- [x] #1774
- [x] #1784
- [x] #1762
- [x] #1794
- [x] #1844
- [x] #1843
- [x] #1865
- [x] https://github.com/apache/arrow-datafusion/pull/1889
- [x] https://github.com/apache/arrow-datafusion/pull/1892
- [x] #1772
- [ ] file format
- [ ] listing and object store
- [ ] #1755
- [ ] move logical plan builder
- [ ] move logical plan structs
- [ ] move move optimizers
- [ ] #1754
- [ ] move physical planner
- [ ] move execution exprs
- [ ] move physical plan optimizer
- [ ] split datafusion-execution sub-module
- [ ] move memory and disk manager
- [ ] move stats collection
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
cc @alamb @houqp @Dandandan what do you think?
I like this idea @Jimexist 👍 There is some related commentary / information here: https://github.com/apache/arrow-datafusion/issues/348 as well
Some other crates that might be useful to consider;
datafusion_core(DataFusionError,DFSchema, etc)datafusion_datasource(the built in parquet, avro, csv and json readers and supporting logic)
This will be helpful for use cases where users are looking to use DataFusion in a similar fashion to Calcite, for query parsing and planning, but not for execution. I like this idea.
Thank you @Jimexist for taking on this! I think this is the right path forward.
cc @jorgecarleitao and @yjshen since they have proposed the similar ideas before.
Hi @Jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?
Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.
Hi @Jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?
Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.
yes it's a good idea and i believe it's included in the list already
Thanks @Jimexist
my current plan is to finish items for:
datafusion-commondatafusion-expr
before release 7:
- #1587
and finish the rest after the release, to cap the amount of changes in a release
@Jimexist could this lead to a smaller ballista client crate as well ? Potentially, this could greatly speed up compilation of programs who just want to be datafusion clients and not run the entire stack.
Contributor
i'm not sure as of now - but probably if we split up logical/physical planning further - but that won't happen soon
Related comment: https://github.com/apache/arrow-datafusion/pull/1762#discussion_r802563462
i wanted to continue iterating on this split but obviously i didn't have time to keep up. a lot has happened in the last few months. i wonder if this is still relevant or it shall be closed now. @alamb and @andygrove any suggestions?
@Jimexist welcome back!
I would say that the main "break datafusion into smaller crates" has been completed. There are still some items in the description of this ticket that might not be done - I would personally recommend filing a new issue with any items that you think would still be good to work on and then closing this one.
Closing this ticket and we can track further splitting in follow on issues. Again 👏
I am thinking of trying to push us to the next level here https://github.com/apache/arrow-datafusion/issues/4181