Welcome_to_BlazingSQL_Notebooks icon indicating copy to clipboard operation
Welcome_to_BlazingSQL_Notebooks copied to clipboard

Number of workers not matching number of nodes

Open tornadoslims opened this issue 3 years ago • 0 comments

+1 on building such an awesome product guys. Here's an issue I've ran into a couple times -

If you hit an OOM or do something else that corrupts state you can lose workers that won't come back with a

bc.dask_client.restart() or client.restart()

This isn't a huge issue bc it can be quickly fixed by stopping and starting the cluster, and if a 32 node cluster drops to 25 workers everything still works.

More of an issue - I just stopped and started a 128 node cluster and it came up with only 1 worker. restarting dask client from within py didn't help. Trying to reproduce. I took some screenshots and kept the logs - will send them over.

JB

tornadoslims avatar Sep 20 '20 07:09 tornadoslims