Blog post with DataFusion July - Sep 2024
Is your feature request related to a problem or challenge?
We have had good luck writing up quarterly updates for DataFusion, most recently: https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/
See https://github.com/apache/datafusion/issues/9602
Describe the solution you'd like
Blog post
Describe alternatives you've considered
No response
Additional context
No response
Here is my wishlist for things to write about in the next blog:
- Charts showing the speedup from https://github.com/apache/datafusion/issues/10918 (from @XiangpengHao @Weijun-H @PsiACE and others)
- Charts showing improvements realted to aggregation improvements (e.g. https://github.com/apache/datafusion/pull/11627 etc with @jayzhan211 and @korowa )
- Something about Substrait (thanks to @Blizzara @dharanad and others) -- is there any big milestone we can claim?
- Maybe something about better
MAPtype support that @goldmedal and others have been working on
Also, of course, I would love to have more help writing a blog (maybe someone else could draft it 🤔 🎣 )
@alamb Thank you for considering me, but I think there may be some confusion - I wasn't involved in the work on Substrait. However, I'd be happy to contribute to a blog post on MAP once I've completed adding support for Arrays in #11436
@alamb Thank you for considering me, but I think there may be some confusion
Yes I was probably confused -- sorry about that
@alamb for Substrait - maybe the work @Lordworms has been doing to add the TPC-H tests would be good at least? From my side, I don't know if there's any precise milestone as such - but maybe something around supporting VirtualTables, more literals and types, better interoperability with other substrait producers. (I do hope to write a separate blog post from our perspective if/when I've proven the whole setup I'm working on works and is faster, but we're not there yet unfortunately.)
Blog with https://github.com/apache/datafusion/pull/11627 performance high cardinality aggs / partial skipping
It would also be cool to discuss efforts for chunked emission https://github.com/apache/datafusion/pull/11943 for (more) aggregage performance
My plan for this is that we will finish up enabling string view and then make that performance improvement be the headline for this post
I think @Omega359 is going to handle this one in
- https://github.com/apache/datafusion-site/pull/57