vllm Issue with raylet error

Hi, I'm using vllm to run llama-13B on two V100-16GB GPUs. I deployed vllm with the API server. However, When the context is long, the server returns:

[2023-08-09 22:39:16,002 E 209 223] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2023-08-09_22-27-32_558284_37 is over 95% full, available space: 60427313152; capacity: 1599538507776. Object creation will fail if spilling is required.

and the model is stuck and cannot return anything. Is it because the GPU size is too small or is there any other approaches to resolve this issue? Thanks!

Aug 09 '23 22:08 ZihanWang314

@ZihanWang314, Got the same warning, but the model is still running. It seems that the disk space is not enough, just use df -h to check the disk space

Aug 11 '23 08:08 irasin

可以把有空间的目录挂载到/tmp/ray; you could use "ln -s space_free_dir /tmp/ray"

Aug 29 '23 03:08 Miraclemarvel55

I am curious because I met the same problem, it seems that the disk space of ray spilling continues to grow until out of disk error accurs.

Nov 21 '23 09:11 zigzagcai

I got same error. and "ln -s space_free_dir /tmp/ray" does not work for me

Nov 29 '23 15:11 yjjiang11

how to specify not using /tmp/ray ?

Dec 18 '23 07:12 oushu1zhangxiangxuan1

Does someone resolve the issue? I'm struggling with the same issue

Jan 21 '24 11:01 hanq-moreh

Does someone resolve the issue? I'm struggling with the same issue

I had the issue when I'm using a docker container. I was able to circumvent the issue by mounting the empty directory to /tmp/ray. I hope this solution could help someone.

For example,

mkdir ./tmp_local
docker run -v ./tmp_local:/tmp/ray ...

Jan 22 '24 02:01 HAN-oQo

clean up disk and keep it under 95% usage, that should fix the issue

Feb 05 '24 15:02 kir152

Same error here, raylet taking up space in /tmp.

Mar 01 '24 08:03 Fr4nk1inCs

Is there a way to tell raylet to use another folder to temporary objects directly from vLLM options?

Apr 05 '24 13:04 CarloNicolini

This is a problem when using managed services that use a container to run the model for you such as Vertex AI or Sage maker since container is started with args the user has no control over so you can't mount /tmp/ over the host's volume to get more storage..

Apr 25 '24 21:04 tsvisab

If you're running the model with a container, specify enough size for shared memory via --shm-size arg then, within your container:

    ray.init(_temp_dir="/dev/shm/tmp_or_whatever", num_gpus=NUM_GPUS, ...)

Apr 29 '24 09:04 tsvisab

clean up disk and keep it under 95% usage, that should fix the issue

This is the proper solution. When your disk space reaches to 100%, the things will anyway hang. I think it is likely huggingface cache is full of model weights downloaded (I've experienced this before).

Some potential solutions.

clean /tmp/ray
clean other dirs that use high disk usage (likely hf cache)
Use bigger volume
https://docs.ray.io/en/master/ray-core/objects/object-spilling.html#cluster-mode -> Use a different spilling dir that has a higher disk size using this config.

If you cannot control your temp dir, one solution is to disable this feature. But please note that this can cause hang. You can do it by RAY_local_fs_capacity_threshold=1 when you start ray (via ray start. I.e., RAY_local_fs_capacity_threshold=1 ray start ...)

May 03 '24 07:05 rkooo567

first: rm /tmp/ray -f then: mkdir a empty dir that use high disk usage, then: ln -s new_empty_dir /tmp/ray final: check the soft link is successful? df /tmp/ray

Aug 16 '24 05:08 weishengying

Does someone resolve the issue? I'm struggling with the same issue

I had the issue when I'm using a docker container. I was able to circumvent the issue by mounting the empty directory to /tmp/ray. I hope this solution could help someone.

For example,
mkdir ./tmp_local
docker run -v ./tmp_local:/tmp/ray ...

I dont quite understand this thread.

what is the actual issue here?

I am facing this issue as well with a docker container using ray. Not actually using vllm. But the issue is the same I suppose. I have enough hard drive storage, but it somehow calculates the 'available space' wrong. so I read mounting a random empty folder from the host machine to the tmp folder helps there? why? how? does it make it slower?

I had the issue when I'm using a docker container. I was able to circumvent the issue by mounting the empty directory to /tmp/ray. I hope this solution could help someone.

For example,
mkdir ./tmp_local
docker run -v ./tmp_local:/tmp/ray ...

I am in your shoes, but i dont understand the solution... why does this work.. how is the available space calculated? is temp limited in size? its so weird

Aug 21 '24 13:08 Liquidmasl

vllm vllm copied to clipboard

Issue with raylet error

vllm
vllm copied to clipboard