mars
mars copied to clipboard
[WIP][Scheduling] Add worker node failover with lineage
What do these changes do?
Currently, the supervisor will cancel the execution of the entire stage after receiving an error report that the execution of the subtask fails when the MainPool process exits in Mars. For traditional batch processing, just rerun is good. However, this is very unfriendly to the scenario of a large job, because it has many subtasks and takes a long time to run. Once it fails, it will be expensive to rerun. At the same time, large jobs generally require much more nodes, and the probability of corresponding node failures will also increase. Once a node failure causes the MainPool to exit, the data on the corresponding node will be lost, and subsequent dependencies execution will fail. For example, a job has been running for more than 20 hours, and the execution is 90% complete, but because a certain MainPool exits, the entire job will fail. Large jobs are relatively common in modern data processing. A job will take up 1200 nodes or more, and it will take about 40 hours. In order to solve the node failure problem and ensure the stable and normal operation of jobs, a complete node failover solution is required.
Related issue number
Issue #3308
Check code requirements
- [ ] tests added / passed (if needed)
- [ ] Ensure all linting tests pass, see here for how to run them