raydp
raydp copied to clipboard
Resource availability issues
I am running the PyTorch_nyctaxi example on a node with 80 CPUs, each with 20 physical cores. If I set the number of executors to a number larger than 16 and the number of workers in TorchEstimator to more than 20, the code won't run and I would get resource unavailable errors, such as these ones:
(raylet) terminate called after throwing an instance of 'std::system_error' (raylet) what(): Resource temporarily unavailable 2021-02-26 11:42:09,982 WARNING worker.py:1090 -- The node with node id e7d998b5c895816e6446685482cb0392ad76245a8dab474562b232d9 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Obviously, I am utilizing a small portion of the computing resources I have available on the node. I do not specify the num_cpus to use in ray.init(), so it should automatically detect all 80.
Any idea why is this happening?
Hi @yanivg10, you mean you have a cluster with 20 workers and each worker has 80 cores? The raydp will occupy num_executors * executor_cores
cpus on the cluster after you call raydp.init_spark
.
So you want to release the spark cluster resources when training model, right?
My cluster is composed of one node with 80 CPUs, each has 20 cores. But RayDP does not allow me to use more than the following setting (I am getting resource availability errors when increasing num_executors):
num_executors = 16 cores_per_executor = 1 memory_per_executor = "1GB" spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)
As for your question, how do I free the spark cluster resources in order to use these CPUs for training?
@yanivg10 , did you check the Ray dashboard? How many cores did Ray detect after running ray.init()
? RayDP will create actors based on the parameters you gave to raydp.init_spark
. It sounds like Ray doesn't have enough resource to meet the requirement.
Thanks. I was able to find a configuration that uses my resources correctly without throwing errors.
Now I want to run RayDP on a cluster of multiple nodes. Do you have instructions on configuring RayDP on multiple nodes?
You can refer to https://docs.ray.io/en/master/cluster/quickstart.html
close as stale