datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

[EPIC] Improving Performance

Open andygrove opened this issue 1 year ago • 0 comments

What is the problem the feature request solves?

This epic is a place to track various ideas around improving query performance.

Some of these ideas apply to upstream DataFusion rather than being Comet-specific.

Planned Work

  • https://github.com/apache/datafusion-comet/issues/669
  • https://github.com/apache/datafusion-comet/issues/670
  • https://github.com/apache/datafusion-comet/issues/679

Ideas to Research

These are longer term ideas to explore.

  • https://github.com/apache/datafusion-comet/issues/571
  • Use selection vectors instead of copying batches during filter operations (see paper at https://www.pdl.cmu.edu/ftp/Database/ngom-damon2021.pdf)
  • Implement StringView / BinaryView
    • Can particulary help in the parquet scan/filter case to avoid string copies. See Andrew Lamb's talk from the June 24 Bay Area DataFusion Meetup for more info.
  • Should we implement native versions of RowToColumnar / ColumnarToRow?
  • Check that we are removing FilterExec when all filter conditions are successfully pushed down to Parquet, to avoid evaluating the filter predicates twice
  • Implement tooling for saving output of query stage to disk so that we can benchmark individual query stages in rust, outside of spark
  • Spark will remove dictionary-encoding after a cast (and maybe after other expressions?) but it could be advantageous to retain the dictionary encoding so that upstream native operators can take advantage of this?

Ideas no longer being pursued

  • Use JIT for evaluating nested expressions to avoid intermediate arrays (there was a datafusion-jit module, but it was abandonded, so we need to see why)
    • Update: previous experiments in Arrow/DataFusion did not show a speedup from this appraoch and it added a lot of extra code and complexity
  • Use mutable vectors during expression evaluation to avoid intermediate arrays (in-place updates are available in arrow-rs)

Related DataFusion issues

Related Comet issues:

  • https://github.com/apache/datafusion-comet/issues/495

Describe the potential solution

No response

Additional context

No response

andygrove avatar Jun 13 '24 17:06 andygrove