ray icon indicating copy to clipboard operation
ray copied to clipboard

[<Ray component: Core|RLlib|etc...>] Ray init problem with GPUs

Open guangzlu opened this issue 1 year ago • 2 comments

What happened + What you expected to happen

Ray corrupted when using ray.init(num_gpus=2)

error info: core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

more info: it works well when using with cpus, that is ray.init(num_cpus=2) We could use rocm-smi see we have AMD GPUs.

Questions: Is that a problem for ray to detect GPUs? What is the basic mechanism for ray to detect GPUs? Is there any other way to prove what the problem is? Is there any walk around or how to solve it?

Versions / Dependencies

Ray version: 2.9.3

Reproduction script

import ray ray.init(num_gpus=2)

Issue Severity

High: It blocks me from completing my task.

guangzlu avatar May 20 '24 07:05 guangzlu

raylet_out_2024-05-20-16-54-06.txt Here is our raylet.out log.

guangzlu avatar May 20 '24 09:05 guangzlu

Update: it can run with specifying both num_cpus and num_gpus. But why it cannot work when only set num_gpus? And how do ray detect num_cpus by default?How to set the num_cpus properly?

guangzlu avatar May 20 '24 09:05 guangzlu

We don't have AMD GPU environments. If you can provide us an environment to reproduce, please ping us on Slack. https://ray-distributed.slack.com/team/U055TQCDAAY

rynewang avatar May 20 '24 22:05 rynewang

If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?

rynewang avatar May 20 '24 22:05 rynewang

If you don't set num_cpus or num_gpus, Ray will auto detect. Have you tried to not set num_gpus and see if it can detect the CPU and GPU counts?

Yes we tried it, but if we don't set any arguments, we just used ray.init(), it would still corrupt. Sorry that we cannot provide an AMD environment right now. But I think the problem is in cpu side. Because it cannot detect cpu automatically, we need to set num_cpus manually. Can you tell me how do ray detect CPUs? And is there any method to figure out more about the problem? For example, check whether the cpu threads are working well?

guangzlu avatar May 21 '24 02:05 guangzlu

We use this code https://github.com/ray-project/ray/blob/e75689e85552cc7b7dc0b4724ff7329496064435/python/ray/_private/utils.py#L544 to detect CPUs. It reads from cgroup files, or from multiprocessing package.

rynewang avatar May 21 '24 04:05 rynewang

Update: we can use multiprocessing.cpu_count() to get cpu number successfully. But we cannot set num_cpus too large. We have 192 cpus on the machine, but we can only set num_cpus to be up to 10. If we set it to be 20, it would interrupt. Here is the log of num_cpus=20. ray-num-cpu-20-log.txt

guangzlu avatar May 21 '24 07:05 guangzlu

@guangzlu,

could you tell us the hardware you are using so we can try to reproduce on our side.

Also could you try to increase ulimit by ulimit -n 65536 python

Also could you try the latest Ray and see if the issue still exists?

jjyao avatar May 28 '24 21:05 jjyao