Elly.jl
Elly.jl copied to clipboard
When workers fail to launch how do you find out why
In the example for the yarn cluster manager, I get a timeout when launching the workers. How can I find out why they are failing to launch in order to debug it? I guess i need to look at some log files somewhere.
There may be some errors logged in Yarn logs.
In a standalone installation, it is usually the $HADOOP_HOME/logs folder.
While launching a lot of workers, the error are usually:
- Not enough resources. Yarn can not fulfill the request. Yarn would log these errors.
- Default time-out is too less. Specify a higher
launch_timeoutwhile creating the cluster manager.