datafusion-comet Planning to publish Roadmap?

Hi, I would like to inquire about a roadmap for Comet.

I am currently exploring the option of replacing Spark workloads with native engines, and I'm keen to understand the future plans and scope of this plugin. Any information on the roadmap would be greatly appreciated.

Thank you!

Feb 14 '24 17:02 okue

Thanks @okue . Yes, will add a roadmap into doc soon. We used to have it internally.

Feb 14 '24 18:02 sunchao

Maybe anyone else who is interested in using / contributing to comet could use this ticket to explain their use case / any features they are interested in helping with

Feb 15 '24 11:02 alamb

I'd like to put a suggestion, based on my experience a lot of production spark workloads have some kind of UDF, good chunk of those UDFs are very simple and can be expressed as SQL expressions.

I assume that in case UDF is used, comet will fall back to classic spark execution, which might not be optimal (I might be wrong, apologise if I am, I did not do my homework to check comet code in depth). My suggestion is to consider adding functionality like https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html which can speed up UDF in comet case as well.

I believe there is nothing GPU specific in that code, and it can be reused, just not sure what would be the best approach.

Maybe @andygrove would be able to help

Feb 15 '24 11:02 milenkovicm

Five years ago, we were delving into bytecode analysis for similar purposes at Apple, as detailed in this video: https://www.youtube.com/watch?v=FWg8iF34RJw. Our implementation differed from Nvidia's, and we encountered many edge cases that didn't work. Consequently, we halted further investment in this area. Nonetheless, we remain optimistic that we can refine the process to reliably handle simpler use-cases, prompting us to consider revisiting this approach.

Feb 15 '24 17:02 dbtsai

Five years ago, we were delving into bytecode analysis for similar purposes at Apple, as detailed in this video: https://www.youtube.com/watch?v=FWg8iF34RJw. Our implementation differed from Nvidia's, and we encountered many edge cases that didn't work. Consequently, we halted further investment in this area. Nonetheless, we remain optimistic that we can refine the process to reliably handle simpler use-cases, prompting us to consider revisiting this approach.

Thanks for sharing @dbtsai!

From my experience some projects mostly have "simple" udfs others may have mostly complex functions. It'll be hard to get it right for all of them. Simple functions usually deal with data validation and they may be called all over the place, so speeding them up may make sense

Feb 15 '24 18:02 milenkovicm

To add on what @dbtsai said, this is the original PR that @aokolnychyi opened. Some of the discussions in the PR are still worthwhile to read.

Also cc @jlowe @andygrove @abellina @tgravescs @winningsix as we have went through exact the same topic lately.

Feb 16 '24 00:02 sunchao

Another perhaps interesting read from the literature is:

"Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" - https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

The findings of the paper is that Vectorized evaluation (aka what we use in DataFusion) and compiled evaluation offer similar performance for most analytic style workloads (though each is better at certain cases than the otehr)

The reason I think JIT / compiled engines are less common than vectorized engines is the software engineering challenges -- that they are often harder to maintain and debug when something goes wrong compared to "normal" vectoized code.

Feb 20 '24 07:02 alamb

Another perhaps interesting read from the literature is:

"Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" - https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

The findings of the paper is that Vectorized evaluation (aka what we use in DataFusion) and compiled evaluation offer similar performance for most analytic style workloads (though each is better at certain cases than the otehr)

The reason I think JIT / compiled engines are less common than vectorized engines is the software engineering challenges -- that they are often harder to maintain and debug when something goes wrong compared to "normal" vectoized code.

I believe authors of "Photon: A Fast Query Engine for Lakehouse Systems" came with similar conclusion.

Feb 20 '24 09:02 milenkovicm

Is there a plan to integrate native Iceberg support? Like https://github.com/apache/iceberg-rust

Mar 06 '24 16:03 senordeveloper

@senordeveloper Yes, we do have plan to integrate with it, but probably sometime in future given the project is still new. It will require some work given our current Parquet reader implementation is a hybrid one with IO & decompression done at the JVM side while decoding at the native side. We have plan to move to a fully native implementation like the one in arrow-rs. After that, it should be much easier to integrate with iceberg-rs or delta-rs.

Mar 06 '24 18:03 sunchao

The short term roadmap is now available in this blog post: https://datafusion.apache.org/blog/2024/07/20/datafusion-comet-0.1.0/

Jul 25 '24 13:07 andygrove