datafusion-comet
                                
                                
                                
                                    datafusion-comet copied to clipboard
                            
                            
                            
                        Plan first release
What is the problem the feature request solves?
During the Comet public meeting this morning, there were questions about when the first official release would be. We do not really have an answer to that yet, but we can use this issue to discuss it.
Here are some ideas for milestones that we may want to achieve before creating an official source release (note that we do not necessarily need to create binary releases right away).
- Ensure that currently supported operators and expressions are fully compatible with all supported Spark versions
 - Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear performance advantage
 - https://github.com/apache/datafusion-comet/issues/394
 - https://github.com/apache/datafusion-comet/issues/142
 
Describe the potential solution
No response
Additional context
No response
- Ensure that currently supported operators and expressions are fully compatible with all supported Spark versions
 
I think this is good requirement for the first release
- Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear performance advantage
 
TPC-H for sure. TPC-DS may be a stretch goal.
One benefit of creating a release is that it is a good opportunity to write a blog post to announce the release and provide an update on the status of the project, and try and encourage more people to contribute. It can also demonstrate the momentum that the project has. I suppose I am now making an argument against waiting until we have great benchmark results before making the first release. I would be interested to hear opinions on this.
I propose that we create the 0.1.0 source release as soon as we have upated the project to use the upcoming DataFusion 39.0.0 release which should be avalable around June 10.
+1. But we may need new arrow-rs release 52.0.0 too. Will it be released before DataFusion 39.0.0?
Here is the tracking issue for arrow-rs 52: https://github.com/apache/arrow-rs/issues/5688
It should be available next week. I updated the DataFusion 39 release issue to add this to the prerequisites for the release: https://github.com/apache/datafusion/issues/10517
I created a Google doc for the community to collaborate on a blog post announcing the release:
https://docs.google.com/document/d/1rnxnbi66oFr5B-OTUxtpi9pnifOxNkmvOVIRTN0BfhY/edit?usp=sharing
I propose that we create the 0.1.0 source release as soon as we have upated the project to use the upcoming DataFusion 39.0.0 release which should be avalable around June 10.
I think this is great news. Some blocking issues are already listed in the issue content.
One issue that popped out of my mind is what about binary releases, especially publishing comet jar into Maven central? I think it's crucial to have a published jar so that the downstream projects such as iceberg could depend on that and leverage Comet's vectorized reader. It might require a lot of extra work to release binaries so that we can skip it for Comet 0.1.0, but it should definitely be planned and hopefully we can release it in the next version.
We plan to do binary release, although it might not be able to catch up the 0.1.0 source release. Publishing to Maven repo needs more works to do. Comet involves native code, so it becomes more complicated than pure Java/Scala projects. We need to include pre-built binaries for different platforms in the published jar.
I assume the source release will tag the repo with a release-0.1.0 tag. Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.
I assume the source release will tag the repo with a
release-0.1.0tag. Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.
Yes, absolutely. We (or anyone) can choose to release binary artifacts from the source release or the tag in the repo. ASF does not have any special involvement in that.
I created a milestone where we can track the priority issues for the 0.1.0 release
https://github.com/apache/datafusion-comet/milestone/1
We plan to do binary release, although it might not be able to catch up the 0.1.0 source release.
I think we are on the same page.
Comet involves native code, so it becomes more complicated than pure Java/Scala projects. We need to include pre-built binaries for different platforms in the published jar.
yes, it might be a bit complicated. But I think the rust toolchain has done an excellent job of cross compiling. If I'm not wrong, the Makefile in this repo already has release-linux target, which builds both linux/mac(both intel and arm cpus) libs. It should be a good starting point.
Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.
Of course, that could be an option. However, a maven artifact should be the convenient/easy way for people in the JVM echosystem.
I think we are getting close to being able to release 0.1.0 now that we are using an official DataFusion release again (or will be in a few days when DF 40 is released to crates.io).
There a few remaining issues in the 0.1.0 milestone, the most important ones (IMO) being:
- https://github.com/apache/datafusion-comet/issues/524
 - https://github.com/apache/datafusion-comet/issues/387
 
@parthchandra @viirya is there anything else that you think we need to address for the first source release?
is there anything else that you think we need to address for the first source release?
No. I think the above two issues are most notable at the moment.
is there anything else that you think we need to address for the first source release?
No. I think the above two issues are most notable at the moment.
I think this is good.
@viirya @parthchandra I no longer think that it is critical to fix https://github.com/apache/datafusion-comet/issues/387 before we release, because users can already enable shuffle, so this is just a config change from user point of view.
If there are no objections, I will plan on creating 0.1.0-rc1 next week, after review documentation to make sure all known issues are documented.
+1
I no longer think that it is critical to fix #387 before we release, because users can already enable shuffle, so this is just a config change from user point of view.
Agreed