datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Blog post with DataFusion July - Sep 2024

Open alamb opened this issue 1 year ago • 7 comments

Is your feature request related to a problem or challenge?

We have had good luck writing up quarterly updates for DataFusion, most recently: https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/

See https://github.com/apache/datafusion/issues/9602

Describe the solution you'd like

Blog post

Describe alternatives you've considered

No response

Additional context

No response

alamb avatar Jul 24 '24 11:07 alamb

Here is my wishlist for things to write about in the next blog:

  • Charts showing the speedup from https://github.com/apache/datafusion/issues/10918 (from @XiangpengHao @Weijun-H @PsiACE and others)
  • Charts showing improvements realted to aggregation improvements (e.g. https://github.com/apache/datafusion/pull/11627 etc with @jayzhan211 and @korowa )
  • Something about Substrait (thanks to @Blizzara @dharanad and others) -- is there any big milestone we can claim?
  • Maybe something about better MAP type support that @goldmedal and others have been working on

Also, of course, I would love to have more help writing a blog (maybe someone else could draft it 🤔 🎣 )

alamb avatar Jul 24 '24 11:07 alamb

@alamb Thank you for considering me, but I think there may be some confusion - I wasn't involved in the work on Substrait. However, I'd be happy to contribute to a blog post on MAP once I've completed adding support for Arrays in #11436

dharanad avatar Jul 24 '24 11:07 dharanad

@alamb Thank you for considering me, but I think there may be some confusion

Yes I was probably confused -- sorry about that

alamb avatar Jul 24 '24 21:07 alamb

@alamb for Substrait - maybe the work @Lordworms has been doing to add the TPC-H tests would be good at least? From my side, I don't know if there's any precise milestone as such - but maybe something around supporting VirtualTables, more literals and types, better interoperability with other substrait producers. (I do hope to write a separate blog post from our perspective if/when I've proven the whole setup I'm working on works and is faster, but we're not there yet unfortunately.)

Blizzara avatar Jul 30 '24 18:07 Blizzara

Blog with https://github.com/apache/datafusion/pull/11627 performance high cardinality aggs / partial skipping

alamb avatar Aug 05 '24 11:08 alamb

It would also be cool to discuss efforts for chunked emission https://github.com/apache/datafusion/pull/11943 for (more) aggregage performance

alamb avatar Aug 22 '24 13:08 alamb

My plan for this is that we will finish up enabling string view and then make that performance improvement be the headline for this post

alamb avatar Oct 15 '24 15:10 alamb

I think @Omega359 is going to handle this one in

  • https://github.com/apache/datafusion-site/pull/57

alamb avatar Feb 22 '25 11:02 alamb