datafusion-comet Plan first release

What is the problem the feature request solves?

During the Comet public meeting this morning, there were questions about when the first official release would be. We do not really have an answer to that yet, but we can use this issue to discuss it.

Here are some ideas for milestones that we may want to achieve before creating an official source release (note that we do not necessarily need to create binary releases right away).

Ensure that currently supported operators and expressions are fully compatible with all supported Spark versions
Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear performance advantage
https://github.com/apache/datafusion-comet/issues/394
https://github.com/apache/datafusion-comet/issues/142

Describe the potential solution

No response

Additional context

No response

May 01 '24 19:05 andygrove

Ensure that currently supported operators and expressions are fully compatible with all supported Spark versions

I think this is good requirement for the first release

Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear performance advantage

TPC-H for sure. TPC-DS may be a stretch goal.

May 02 '24 23:05 parthchandra

One benefit of creating a release is that it is a good opportunity to write a blog post to announce the release and provide an update on the status of the project, and try and encourage more people to contribute. It can also demonstrate the momentum that the project has. I suppose I am now making an argument against waiting until we have great benchmark results before making the first release. I would be interested to hear opinions on this.

May 14 '24 16:05 andygrove

I propose that we create the 0.1.0 source release as soon as we have upated the project to use the upcoming DataFusion 39.0.0 release which should be avalable around June 10.

May 31 '24 16:05 andygrove

+1. But we may need new arrow-rs release 52.0.0 too. Will it be released before DataFusion 39.0.0?

May 31 '24 18:05 viirya

Here is the tracking issue for arrow-rs 52: https://github.com/apache/arrow-rs/issues/5688

It should be available next week. I updated the DataFusion 39 release issue to add this to the prerequisites for the release: https://github.com/apache/datafusion/issues/10517

May 31 '24 19:05 andygrove

I created a Google doc for the community to collaborate on a blog post announcing the release:

https://docs.google.com/document/d/1rnxnbi66oFr5B-OTUxtpi9pnifOxNkmvOVIRTN0BfhY/edit?usp=sharing

May 31 '24 19:05 andygrove

I propose that we create the 0.1.0 source release as soon as we have upated the project to use the upcoming DataFusion 39.0.0 release which should be avalable around June 10.

I think this is great news. Some blocking issues are already listed in the issue content.

One issue that popped out of my mind is what about binary releases, especially publishing comet jar into Maven central? I think it's crucial to have a published jar so that the downstream projects such as iceberg could depend on that and leverage Comet's vectorized reader. It might require a lot of extra work to release binaries so that we can skip it for Comet 0.1.0, but it should definitely be planned and hopefully we can release it in the next version.

Jun 04 '24 13:06 advancedxy

We plan to do binary release, although it might not be able to catch up the 0.1.0 source release. Publishing to Maven repo needs more works to do. Comet involves native code, so it becomes more complicated than pure Java/Scala projects. We need to include pre-built binaries for different platforms in the published jar.

Jun 04 '24 14:06 viirya

I assume the source release will tag the repo with a release-0.1.0 tag. Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.

Jun 05 '24 00:06 parthchandra

I assume the source release will tag the repo with a release-0.1.0 tag. Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.

Yes, absolutely. We (or anyone) can choose to release binary artifacts from the source release or the tag in the repo. ASF does not have any special involvement in that.

Jun 05 '24 21:06 andygrove

I created a milestone where we can track the priority issues for the 0.1.0 release

https://github.com/apache/datafusion-comet/milestone/1

Jun 05 '24 22:06 andygrove

We plan to do binary release, although it might not be able to catch up the 0.1.0 source release.

I think we are on the same page.

Comet involves native code, so it becomes more complicated than pure Java/Scala projects. We need to include pre-built binaries for different platforms in the published jar.

yes, it might be a bit complicated. But I think the rust toolchain has done an excellent job of cross compiling. If I'm not wrong, the Makefile in this repo already has release-linux target, which builds both linux/mac(both intel and arm cpus) libs. It should be a good starting point.

Even though a maven artifact would not be published, it does allow projects to build their own, or even add comet as a git submodule, based on a relatively 'stable' version.

Of course, that could be an option. However, a maven artifact should be the convenient/easy way for people in the JVM echosystem.

Jun 05 '24 23:06 advancedxy

I think we are getting close to being able to release 0.1.0 now that we are using an official DataFusion release again (or will be in a few days when DF 40 is released to crates.io).

There a few remaining issues in the 0.1.0 milestone, the most important ones (IMO) being:

https://github.com/apache/datafusion-comet/issues/524
https://github.com/apache/datafusion-comet/issues/387

@parthchandra @viirya is there anything else that you think we need to address for the first source release?

Jul 10 '24 14:07 andygrove

is there anything else that you think we need to address for the first source release?

No. I think the above two issues are most notable at the moment.

Jul 10 '24 15:07 viirya

is there anything else that you think we need to address for the first source release?

No. I think the above two issues are most notable at the moment.

I think this is good.

Jul 11 '24 00:07 parthchandra

@viirya @parthchandra I no longer think that it is critical to fix https://github.com/apache/datafusion-comet/issues/387 before we release, because users can already enable shuffle, so this is just a config change from user point of view.

If there are no objections, I will plan on creating 0.1.0-rc1 next week, after review documentation to make sure all known issues are documented.

Jul 19 '24 16:07 andygrove

+1

Jul 19 '24 19:07 viirya

I no longer think that it is critical to fix #387 before we release, because users can already enable shuffle, so this is just a config change from user point of view.

Agreed

Jul 20 '24 01:07 parthchandra

datafusion-comet datafusion-comet copied to clipboard

Plan first release

What is the problem the feature request solves?

Describe the potential solution

Additional context

datafusion-comet
datafusion-comet copied to clipboard