[<Ray component: Core|RLlib|etc...>] (raylet) local_resource_manager.cc:287: Object store memory is not idle. flood the screen
What happened + What you expected to happen
I have a four node ray cluster in Docker containers. The raylet.out of all four ray nodes have been flooded by "local_resource_manager.cc:287: Object store memory is not idle" .
The file size of raylet.out: 2G
/tmp/ray/session_latest/logs# find . -type f -size +5M -exec ls -lh {} +
-rw-r--r-- 1 root root 2.1G Aug 26 21:46 ./raylet.out
...
/tmp/ray/session_latest/logs/raylet.out
root@53b46bbb4b42:/tmp/ray/session_latest/logs# tail -f raylet.out
[2024-08-26 21:30:10,113 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,218 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,322 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,425 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,529 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,633 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,937 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,140 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,243 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,347 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,756 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,860 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,965 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,075 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,184 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,387 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
.....
Versions / Dependencies
Ray version: 2.10.0 Docker version: 23.0.1, build a5ee5b1 Python: 3.8.18 VM OS:
uname -m && cat /etc/*release
x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Reproduction script
version: '3'
services:
ray_head:
image: ray:2.10.0
environment:
- RAY_MODE=header
# - RAY_LOG_TO_STDERR=1
- RAY_COLOR_PREFIX=1
- RAY_memory_monitor_refresh_ms=0
# - NODE_IP_ADDRESS=ray_head
# - RAY_DASHBOARD_RPC_PORT=7001
volumes:
- /etc/hosts:/etc/hosts
- /data/ronds/ray:/tmp/ray
- type: tmpfs
target: /dev/shm
tmpfs:
size: 16384000000 # (this means 16GB)
ports:
- '6379:6379'
- '8265:8265'
- '55518:55518'
- '44217:44217'
- '44227:44227'
- '10001:10001'
deploy:
replicas: 1
# resources:
# limits:
# memory: 16G
# reservations:
# memory: 2G
placement:
constraints:
- node.role == manager
# - node.hostname == ray01
ray_worker:
image: ray:2.10.0
environment:
- RAY_MODE=worker
# - RAY_LOG_TO_STDERR=1
- RAY_COLOR_PREFIX=1
# - RAY_memory_monitor_refresh_ms=0
- HEAD_ADDRESS=ray_head:6379
volumes:
- /etc/hosts:/etc/hosts
- /data/ronds/ray:/tmp/ray
- type: tmpfs
target: /dev/shm
tmpfs:
size: 16384000000 # (this means 16GB)
deploy:
mode: global
# replicas: 1
# resources:
# limits:
# memory: 16G
# reservations:
# memory: 2G
placement:
constraints:
- node.role == worker
# - node.hostname == ray02
restart_policy:
condition: on-failure
delay: 5s
window: 120s
depends_on:
- ray_head
networks:
default:
name: ray_network
external: true
attachable: true
Issue Severity
Medium: It is a significant difficulty but I can work around it.
I think we can make it a debug log. let me create a PR. Regarding why the logs are printed for this environment, I will follow up
new content:
....
[2024-08-27 10:08:28,193 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,297 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,400 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,478 I 115 148] (raylet) store.cc:564: ========== Plasma store: =================
Current usage: 0.0831483 / 14.7456 GB
- num bytes created total: 845281967869
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 2
- bytes in use: 320270
- objects evictable: 186
- bytes evictable: 82827991
- objects created by worker: 9
- bytes created by worker: 1563696
- objects restored: 0
- bytes restored: 0
- objects received: 179
- bytes received: 81584565
- objects errored: 0
- bytes errored: 0
[2024-08-27 10:08:28,502 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,706 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
...
I think we can make it a debug log. let me create a PR. Regarding why the logs are printed for this environment, I will follow up
@rkooo567 I hava a long running ray job. I find that the file size of raylet.out and job-driver-[submission_id].log will grow constantly, likely raylet.out: 2G. If the logs size is too large, The Dashbord will very slow, causing memory and CPU growth and event broken down. I have already configured 'RAY_ROTATION_MAX_BYTES'. How can I make the raylet.out to be rotation log ? Thank You
The general recommendation is to use a tool like https://linux.die.net/man/8/logrotate. That's how we rotate logs in anyscale platform as well
The general recommendation is to use a tool like https://linux.die.net/man/8/logrotate. That's how we rotate logs in anyscale platform as well
I think It maybe work below if raylet used logrotate:
mv raylet.out raylet.out.1
touch raylet.out
kill -HUP $(pgrep raylet)
raylet received KILL -HUP and should reopen raylet.out. But It's not
ls -lh | grep raylet
-rw-r--r-- 1 root root 0 Nov 7 10:02 raylet.err
-rw-r--r-- 1 root root 0 Nov 7 11:40 raylet.out
-rw-r--r-- 1 root root 11M Nov 7 11:56 raylet.out.1
It still write to raylet.out.1. Please help @rkooo567