datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

[EPIC] Ballista 2025/H2 Roadmap Proposal

Open milenkovicm opened this issue 8 months ago • 4 comments

Following the completion of #1068, it's time to propose the next steps for Ballista.

In the short term, I would like to focus on the following areas:

  • Improving test coverage: We continue to encounter bugs that could be prevented with more comprehensive unit and integration tests.
  • Code cleanup and refactoring: There are several areas in the codebase that can be simplified or refactored, which can complement the testing improvements.
  • Improving shuffle: There are numerous open issues related to shuffle files, and resolving them could yield significant benefits.
  • Job-related enhancements: This includes improvements to job dependency graphs, adaptive query execution, and related functionality.
  • Enhanced observability: Increasing the scope of scheduler-emitted events, including executor-related events, will help improve visibility and debugging.
  • Simplifying and improving GitHub Actions: Streamlining our CI/CD processes to be more efficient and maintainable.

More details will follow after further discussion with the community.

Once again, thank you all for the incredible support on #1068!

milenkovicm avatar Apr 18 '25 16:04 milenkovicm

There has been a lot of progress with shuffle performance in Comet that Ballista could benefit from.

andygrove avatar Apr 18 '25 20:04 andygrove

I would take shuffle related task with highest priority @andygrove was thinking of #320 and few others related to compression, schema serialization and so on, but if there is easy picks in comet I'm more than happy to start from there. It would be great if you could provide few pointers for me to start

milenkovicm avatar Apr 18 '25 20:04 milenkovicm

There is work in progress to add a datafusion-spark crate in the core DataFusion repo. See https://github.com/apache/datafusion/issues/5600 and https://github.com/apache/datafusion/pull/15168.

I would be happy to move some parts of Comet shuffle into this crate once it is available.

edit: using a Spark compatible shuffle file format may not necessarily be attractive for Ballista. We'll have to see if that makes sense or not.

andygrove avatar Apr 18 '25 20:04 andygrove

would be happy to help. will have a look at comet

milenkovicm avatar Apr 18 '25 20:04 milenkovicm

closing task as we're in the middle of H2/25

milenkovicm avatar Sep 14 '25 09:09 milenkovicm