datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Scheduler will lost the registered executor when restart it in the `push` mode

Open liukun4515 opened this issue 3 years ago • 3 comments

Describe the bug When i restart the schedule, the schedule lost all the information of registered executor

To Reproduce Start a scheduler with below config:

scheduler_policy="PushStaged"

Start a executor with below config:

scheduler_port=50050
scheduler_host="localhost"
# PushStaged or PullStaged
task_scheduling_policy="PushStaged"

then kill the scheduler and restart the scheduler using the same config.

And the scheduler will lost all registered executor in the memory.

Expected behavior We should recover this data in memory after the scheduler restart.

Solution: heartbeat with the registered information for the executor

Additional context Add any other context about the problem here.

liukun4515 avatar Jul 20 '22 08:07 liukun4515

@liukun4515 Are you running in standalone mode? It should initialize any registered executors from the backend if you are using etcd as the state backend but in standalone mode the persistent state is stored in sled DB on disk (in a temp file). If we wanted to make standalone mode persist state across restarts then we would need to make the sled DB location a configurable path.

thinkharderdev avatar Jul 25 '22 10:07 thinkharderdev

Is it still an issue?

mingmwang avatar Aug 15 '22 12:08 mingmwang

@liukun4515 @thinkharderdev @mingmwang This can be fixed by specifying the --sled-dir parameter when starting the scheduler service.

r4ntix avatar Sep 15 '22 11:09 r4ntix