[<Ray component: Core|RLlib|etc...>] Ray init problem with GPUs
What happened + What you expected to happen
Ray corrupted when using ray.init(num_gpus=2)
error info: core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
more info: it works well when using with cpus, that is ray.init(num_cpus=2) We could use rocm-smi see we have AMD GPUs.
Questions: Is that a problem for ray to detect GPUs? What is the basic mechanism for ray to detect GPUs? Is there any other way to prove what the problem is? Is there any walk around or how to solve it?
Versions / Dependencies
Ray version: 2.9.3
Reproduction script
import ray ray.init(num_gpus=2)
Issue Severity
High: It blocks me from completing my task.
raylet_out_2024-05-20-16-54-06.txt Here is our raylet.out log.
Update: it can run with specifying both num_cpus and num_gpus. But why it cannot work when only set num_gpus? And how do ray detect num_cpus by default?How to set the num_cpus properly?
We don't have AMD GPU environments. If you can provide us an environment to reproduce, please ping us on Slack. https://ray-distributed.slack.com/team/U055TQCDAAY
If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?
If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?
Yes we tried it, but if we don't set any arguments, we just used ray.init(), it would still corrupt. Sorry that we cannot provide an AMD environment right now. But I think the problem is in cpu side. Because it cannot detect cpu automatically, we need to set num_cpus manually. Can you tell me how do ray detect CPUs? And is there any method to figure out more about the problem? For example, check whether the cpu threads are working well?
We use this code https://github.com/ray-project/ray/blob/e75689e85552cc7b7dc0b4724ff7329496064435/python/ray/_private/utils.py#L544 to detect CPUs. It reads from cgroup files, or from multiprocessing package.
Update: we can use multiprocessing.cpu_count() to get cpu number successfully. But we cannot set num_cpus too large. We have 192 cpus on the machine, but we can only set num_cpus to be up to 10. If we set it to be 20, it would interrupt. Here is the log of num_cpus=20. ray-num-cpu-20-log.txt
@guangzlu,
could you tell us the hardware you are using so we can try to reproduce on our side.
Also could you try to increase ulimit by ulimit -n 65536 python
Also could you try the latest Ray and see if the issue still exists?