datafusion Break datafusion crate into smaller crates

trafficstars

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

a new feature: to break datafusion crate into separate smaller crates.

It helps with code management and dependency reasoning

Describe the solution you'd like

[x] #1752
- [x] #1751
- [x] #1758
- [x] #1760
[x] #1753
- [x] #1759
- [x] #1761
- [x] #1763
- [x] #1764
- [x] #1765
- [x] #1762
- [x] #1774
- [x] #1784
- [x] #1762
- [x] #1794

[x] #1844
- [x] #1843
- [x] #1865
- [x] https://github.com/apache/arrow-datafusion/pull/1889
- [x] https://github.com/apache/arrow-datafusion/pull/1892
[x] #1772
- [ ] file format
- [ ] listing and object store
[ ] #1755
- [ ] move logical plan builder
- [ ] move logical plan structs
- [ ] move move optimizers
[ ] #1754
- [ ] move physical planner
- [ ] move execution exprs
- [ ] move physical plan optimizer
[ ] split datafusion-execution sub-module
- [ ] move memory and disk manager
- [ ] move stats collection

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Feb 05 '22 06:02 jimexist

cc @alamb @houqp @Dandandan what do you think?

Feb 05 '22 06:02 jimexist

I like this idea @Jimexist 👍 There is some related commentary / information here: https://github.com/apache/arrow-datafusion/issues/348 as well

Some other crates that might be useful to consider;

datafusion_core (DataFusionError, DFSchema, etc)
datafusion_datasource (the built in parquet, avro, csv and json readers and supporting logic)

Feb 05 '22 11:02 alamb

This will be helpful for use cases where users are looking to use DataFusion in a similar fashion to Calcite, for query parsing and planning, but not for execution. I like this idea.

Feb 05 '22 15:02 andygrove

Thank you @Jimexist for taking on this! I think this is the right path forward.

cc @jorgecarleitao and @yjshen since they have proposed the similar ideas before.

Feb 06 '22 05:02 houqp

Hi @Jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?

Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.

Feb 07 '22 07:02 yahoNanJing

Hi @Jimexist, is it possible to make an exclusive crate for the data source so that it will be easy for the integration of the different kinds of remote object store?

Suppose I hope to introduce the hdfs as one of the remote object store. To be less intrusive, it's better to make an independent datafusion-objectstore-hdfs crate which depends on the datasource crate of datafusion. Then it would be much easier for other crates to decide whether to depend on the datafusion-hdfs crate or not without cyclic dependencies.

yes it's a good idea and i believe it's included in the list already

Feb 07 '22 09:02 jimexist

Thanks @Jimexist

Feb 08 '22 01:02 yahoNanJing

my current plan is to finish items for:

datafusion-common
datafusion-expr

before release 7:

#1587

and finish the rest after the release, to cap the amount of changes in a release

Feb 08 '22 04:02 jimexist

@Jimexist could this lead to a smaller ballista client crate as well ? Potentially, this could greatly speed up compilation of programs who just want to be datafusion clients and not run the entire stack.

Feb 09 '22 11:02 Igosuki

Contributor

i'm not sure as of now - but probably if we split up logical/physical planning further - but that won't happen soon

Feb 09 '22 15:02 jimexist

Related comment: https://github.com/apache/arrow-datafusion/pull/1762#discussion_r802563462

Feb 09 '22 21:02 alamb

i wanted to continue iterating on this split but obviously i didn't have time to keep up. a lot has happened in the last few months. i wonder if this is still relevant or it shall be closed now. @alamb and @andygrove any suggestions?

Jul 20 '22 09:07 jimexist

@Jimexist welcome back!

I would say that the main "break datafusion into smaller crates" has been completed. There are still some items in the description of this ticket that might not be done - I would personally recommend filing a new issue with any items that you think would still be good to work on and then closing this one.

Jul 20 '22 10:07 alamb

Closing this ticket and we can track further splitting in follow on issues. Again 👏

Oct 24 '22 14:10 alamb

I am thinking of trying to push us to the next level here https://github.com/apache/arrow-datafusion/issues/4181

Nov 11 '22 22:11 alamb

datafusion datafusion copied to clipboard

Break datafusion crate into smaller crates

datafusion
datafusion copied to clipboard