ray icon indicating copy to clipboard operation
ray copied to clipboard

[<Ray component: Core|RLlib|etc...>] (raylet) local_resource_manager.cc:287: Object store memory is not idle. flood the screen

Open yunlou11 opened this issue 1 year ago • 3 comments

What happened + What you expected to happen

I have a four node ray cluster in Docker containers. The raylet.out of all four ray nodes have been flooded by "local_resource_manager.cc:287: Object store memory is not idle" .
The file size of raylet.out: 2G

/tmp/ray/session_latest/logs# find . -type f -size +5M -exec ls -lh {} +
-rw-r--r-- 1 root root 2.1G Aug 26 21:46 ./raylet.out
...

/tmp/ray/session_latest/logs/raylet.out

root@53b46bbb4b42:/tmp/ray/session_latest/logs# tail -f raylet.out
[2024-08-26 21:30:10,113 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,218 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,322 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,425 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,529 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,633 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:10,937 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,140 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,243 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,347 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,756 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,860 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:11,965 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,075 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,184 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-26 21:30:12,387 I 393 393] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
.....

Versions / Dependencies

Ray version: 2.10.0 Docker version: 23.0.1, build a5ee5b1 Python: 3.8.18 VM OS:

uname -m && cat /etc/*release
x86_64
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Reproduction script

version: '3'

services:
  ray_head:
    image: ray:2.10.0
    environment:
      - RAY_MODE=header
      # - RAY_LOG_TO_STDERR=1
      - RAY_COLOR_PREFIX=1
      - RAY_memory_monitor_refresh_ms=0
      # - NODE_IP_ADDRESS=ray_head
      # - RAY_DASHBOARD_RPC_PORT=7001
    volumes:
      - /etc/hosts:/etc/hosts
      - /data/ronds/ray:/tmp/ray
      - type: tmpfs
        target: /dev/shm
        tmpfs:
          size: 16384000000 # (this means 16GB)
    ports:
      - '6379:6379'
      - '8265:8265'
      - '55518:55518'
      - '44217:44217'
      - '44227:44227'
      - '10001:10001'
    deploy:
      replicas: 1
      # resources:
      #   limits:
      #     memory: 16G
      #   reservations:
      #     memory: 2G
      placement:
        constraints:
          - node.role == manager
          # - node.hostname == ray01
  ray_worker:
    image: ray:2.10.0
    environment:
      - RAY_MODE=worker
      # - RAY_LOG_TO_STDERR=1
      - RAY_COLOR_PREFIX=1
      # - RAY_memory_monitor_refresh_ms=0
      - HEAD_ADDRESS=ray_head:6379
    volumes:
      - /etc/hosts:/etc/hosts
      - /data/ronds/ray:/tmp/ray
      - type: tmpfs
        target: /dev/shm
        tmpfs:
          size: 16384000000 # (this means 16GB)
    deploy:
      mode: global
      # replicas: 1
      # resources:
      #   limits:
      #     memory: 16G
      #   reservations:
      #     memory: 2G
      placement:
        constraints:
          - node.role == worker
          # - node.hostname == ray02
      restart_policy:
        condition: on-failure
        delay: 5s
        window: 120s
    depends_on:
      - ray_head
networks:
  default:
    name: ray_network
    external: true
    attachable: true

Issue Severity

Medium: It is a significant difficulty but I can work around it.

yunlou11 avatar Aug 26 '24 14:08 yunlou11

I think we can make it a debug log. let me create a PR. Regarding why the logs are printed for this environment, I will follow up

rkooo567 avatar Aug 26 '24 21:08 rkooo567

new content:

....
[2024-08-27 10:08:28,193 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,297 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,400 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,478 I 115 148] (raylet) store.cc:564: ========== Plasma store: =================
Current usage: 0.0831483 / 14.7456 GB
- num bytes created total: 845281967869
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 2
- bytes in use: 320270
- objects evictable: 186
- bytes evictable: 82827991

- objects created by worker: 9
- bytes created by worker: 1563696
- objects restored: 0
- bytes restored: 0
- objects received: 179
- bytes received: 81584565
- objects errored: 0
- bytes errored: 0

[2024-08-27 10:08:28,502 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
[2024-08-27 10:08:28,706 I 115 115] (raylet) local_resource_manager.cc:287: Object store memory is not idle.
...

yunlou11 avatar Aug 27 '24 02:08 yunlou11

I think we can make it a debug log. let me create a PR. Regarding why the logs are printed for this environment, I will follow up

@rkooo567 I hava a long running ray job. I find that the file size of raylet.out and job-driver-[submission_id].log will grow constantly, likely raylet.out: 2G. If the logs size is too large, The Dashbord will very slow, causing memory and CPU growth and event broken down. I have already configured 'RAY_ROTATION_MAX_BYTES'. How can I make the raylet.out to be rotation log ? Thank You

yunlou11 avatar Aug 27 '24 12:08 yunlou11

The general recommendation is to use a tool like https://linux.die.net/man/8/logrotate. That's how we rotate logs in anyscale platform as well

rkooo567 avatar Aug 30 '24 05:08 rkooo567

The general recommendation is to use a tool like https://linux.die.net/man/8/logrotate. That's how we rotate logs in anyscale platform as well

I think It maybe work below if raylet used logrotate:

mv raylet.out raylet.out.1
touch raylet.out
kill -HUP $(pgrep raylet)

raylet received KILL -HUP and should reopen raylet.out. But It's not

ls -lh | grep raylet
-rw-r--r-- 1 root root     0 Nov  7 10:02 raylet.err
-rw-r--r-- 1 root root     0 Nov  7 11:40 raylet.out
-rw-r--r-- 1 root root   11M Nov  7 11:56 raylet.out.1

It still write to raylet.out.1. Please help @rkooo567

yunlou11 avatar Nov 07 '24 03:11 yunlou11