datafusion-ballista Ballista Enhancement Overview

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Current Ballista implementation is more like a POC product for verification of whether it's able to run the Datafusion operators in a distributed way. It helps set up the whole framework and works well for just verification. However, it's a long way to introduce it to the production environment for real cases. This issue mainly raises several aspects we need to consider and to enhance for a more robust distributed execution framework.

In big data era, there're many scenarios. Two common ones are query for interactive analysis and batch processing for ETL purpose. There's no silver bullet. Each scenario has its own characteristics and has its own needs. In the following, I'll describe some enhancement we can do for each scenario.

For both interactive query and batch processing:

[ ] [Necessary] apache/arrow-ballista#6
[x] [Necessary] apache/arrow-datafusion#1703
[ ] [Necessary] apache/arrow-datafusion#1704
[ ] [Necessary] Support to fast recovery of scheduler restarting
[ ] [Necessary] Support to better handle executor lost
[ ] [Necessary] Support to better manage configurations
[ ] [Nice to have] Support to schedule stages based on priority
[ ] [Nice to have] Support to cancel SQL or cancel Job
[ ] [Nice to have] Support executor blacklist

For interactive query:

[x] [Necessary] Support push-based task assignment apache/arrow-datafusion#1221
[ ] [Necessary] Support better data exchange, don't spill to disk apache/arrow-datafusion#1805
[ ] [Necessary] Support better result fetching, don't spill to disk

For batch processing:

[ ] [Necessary] Support task speculative scheduling
[ ] [Necessary] Support shuffle fetch failure handling and retry
[ ] [Necessary] Support to reattempt some stages

Jan 29 '22 06:01 yahoNanJing

This would be a milestone in Ballista! 👍

Jan 29 '22 08:01 Ted-Jiang

Great, I hope I can contribute to these goals as much as I can.

Feb 07 '22 02:02 EricJoy2048

Great, I hope I can contribute to these goals as much as I can.

Hi @gaojun2048, which part are you interested in? Feel free to pick up some tasks.

Feb 14 '22 08:02 yahoNanJing

Is ballista targeting a data computing engine like spark or an ad-hoc query engine like Presto / CK / impala? I believe that our roadmap is different under different goals.

Feb 26 '22 15:02 EricJoy2048

datafusion-ballista datafusion-ballista copied to clipboard

Ballista Enhancement Overview

datafusion-ballista
datafusion-ballista copied to clipboard