ray icon indicating copy to clipboard operation
ray copied to clipboard

[Dashboard | Core] Some Ray subprocesses exited unexpectedly: dashboard [exit code=-11]

Open hebiao064 opened this issue 9 months ago • 5 comments

What happened + What you expected to happen

ray start --head --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --dashboard-host=0.0.0.0 --disable-usage-stats --verbose will fail in my latest built docker image

Error log in terminal:

Some Ray subprocesses exited unexpectedly:
  dashboard [exit code=-11]

cat dashboard_agent.log

2024-04-29 18:24:44,393 INFO agent.py:138 -- Dashboard agent grpc address: 0.0.0.0:63923
2024-04-29 18:24:44,394 INFO utils.py:112 -- Get all modules by type: DashboardAgentModule
2024-04-29 18:24:44,842 INFO utils.py:145 -- Available modules: [<class 'ray.dashboard.modules.event.event_agent.EventAgent'>, <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>, <class 'ray.dashboard.modules.job.job_agent.JobAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>, <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>]
2024-04-29 18:24:44,842 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2024-04-29 18:24:44,842 INFO event_agent.py:38 -- Event agent cache buffer size: 10240
2024-04-29 18:24:44,843 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>
2024-04-29 18:24:44,843 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.job.job_agent.JobAgent'>
2024-04-29 18:24:44,843 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2024-04-29 18:24:44,843 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>
2024-04-29 18:24:44,843 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2024-04-29 18:24:44,845 INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>
2024-04-29 18:24:44,845 INFO agent.py:159 -- Loaded 7 modules.
2024-04-29 18:24:44,847 INFO http_server_agent.py:71 -- Dashboard agent http address: 0.0.0.0:52365
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/local_raylet_healthz> -> <function HealthzAgent.health_check at 0x72a0d87d7250>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/local_raylet_healthz> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [POST] <PlainResource  /api/job_agent/jobs/> -> <function JobAgent.submit_job at 0x72a0d8617e20>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/job_agent/jobs/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [POST] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <function JobAgent.stop_job at 0x72a0d8644040>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,847 INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <function JobAgent.delete_job at 0x72a0d86441f0>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <function JobAgent.get_job_logs at 0x72a0d86443a0>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <function JobAgent.tail_job_logs at 0x72a0d8644550>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/ray/version> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_version at 0x72a0bdf87520>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/ray/version> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_serve_instance_details at 0x72a0bdf875b0>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.delete_serve_applications at 0x72a0bdf877f0>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [PUT] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.put_all_applications at 0x72a0bdf879a0>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/applications/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-04-29_18-24-41_106247_55/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-04-29_18-24-41_106247_55/logs')>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-04-29_18-24-41_106247_55/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x72a0d81a2650>>
2024-04-29 18:24:44,848 INFO http_server_agent.py:79 -- Registered 20 routes.
2024-04-29 18:24:44,849 INFO process_watcher.py:46 -- raylet pid is 182
2024-04-29 18:24:44,850 INFO process_watcher.py:66 -- check_parent_via_pipe
2024-04-29 18:24:44,851 INFO event_agent.py:56 -- Report events to 172.17.0.8:43209
2024-04-29 18:24:44,852 INFO event_utils.py:132 -- Monitor events logs modified after 1714413284.5424106 on /tmp/ray/session_2024-04-29_18-24-41_106247_55/logs/events, the source types are all.
2024-04-29 18:25:46,273 INFO agent.py:220 -- Terminated Raylet: ip=172.17.0.8, node_id=dfd7a71b22571decfab7691850aba504768a24031b3bc32f75ae14b0. _check_parent_via_pipe: The parent is dead.
2024-04-29 18:25:46,273 ERROR process_watcher.py:116 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 216.676 us, total = 216.676 us, Queueing time: mean = 14.018 us, max = 14.018 us, min = 14.018 us, total = 14.018 us
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 7.143 us, total = 7.143 us, Queueing time: mean = 12.443 us, max = 12.443 us, min = 12.443 us, total = 12.443 us
    [state-dump]        MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 387.533 us, total = 387.533 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.013 s, total = 1.013 s, Queueing time: mean = 16.459 us, max = 16.459 us, min = 16.459 us, total = 16.459 us
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 438.650 us, total = 438.650 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 997.997 us, total = 997.997 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2024-04-29 18:24:44,086 I 182 182] (raylet) accessor.cc:627: Received notification for node id = dfd7a71b22571decfab7691850aba504768a24031b3bc32f75ae14b0, IsAlive = 1

2024-04-29 18:25:47,477 ERROR thread.py:58 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2024-04-29 18:26:46,288 ERROR utils.py:211 -- Failed to publish error: Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 216.676 us, total = 216.676 us, Queueing time: mean = 14.018 us, max = 14.018 us, min = 14.018 us, total = 14.018 us
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 7.143 us, total = 7.143 us, Queueing time: mean = 12.443 us, max = 12.443 us, min = 12.443 us, total = 12.443 us
    [state-dump]        MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 387.533 us, total = 387.533 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.013 s, total = 1.013 s, Queueing time: mean = 16.459 us, max = 16.459 us, min = 16.459 us, total = 16.459 us
    [state-dump]        NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 438.650 us, total = 438.650 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump]        NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 997.997 us, total = 997.997 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 
    [2024-04-29 18:24:44,086 I 182 182] (raylet) accessor.cc:627: Received notification for node id = dfd7a71b22571decfab7691850aba504768a24031b3bc32f75ae14b0, IsAlive = 1
 [type raylet_died]
Traceback (most recent call last):
  File "/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/_private/utils.py", line 207, in publish_error_to_driver
    gcs_publisher.publish_error(
  File "python/ray/_raylet.pyx", line 3036, in ray._raylet.GcsPublisher.publish_error
  File "python/ray/_raylet.pyx", line 573, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Failed to publish after retries: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.8:6379: Failed to connect to remote host: Connection refused

Versions / Dependencies

Python 3.10.2

cat /etc/os-release NAME="Common Base Linux Mariner" VERSION="2.0.20240223" ID=mariner VERSION_ID="2.0" PRETTY_NAME="CBL-Mariner/Linux" ANSI_COLOR="1;34" HOME_URL="https://aka.ms/cbl-mariner" BUG_REPORT_URL="https://aka.ms/cbl-mariner" SUPPORT_URL="https://aka.ms/cbl-mariner"

ray 2.11.0

Reproduction script

It's built with internal dependency within company, so cannot share repro steps outside. We understand that it's hard to investigate without repros steps, just to share it out in case Ray Team has seen similar issues before, or let me know how to debug the dashboard failure.

Issue Severity

High: It blocks me from completing my task.

hebiao064 avatar Apr 29 '24 18:04 hebiao064

I noticed a similar issue: https://github.com/ray-project/ray/issues/31261, but we are not building ray from source.

hebiao064 avatar Apr 29 '24 18:04 hebiao064

cc @alanwguo

kevin85421 avatar Apr 29 '24 19:04 kevin85421

Looks like GCS is dead. CAn you check if there is a running gcs_server process? CAn you also look for a log gcs_server.out?

alanwguo avatar Apr 29 '24 20:04 alanwguo

@alanwguo Here is the logs, thanks!

jobuser [ /tmp/ray/session_latest/logs ]$ cat gcs_server.err
jobuser [ /tmp/ray/session_latest/logs ]$ cat gcs_server.out
[2024-04-29 21:39:47,762 I 57 57] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-04-29 21:39:47,763 I 57 57] (gcs_server) event.cc:234: Set ray event level to warning
[2024-04-29 21:39:47,763 I 57 57] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-04-29 21:39:47,764 I 57 57] (gcs_server) gcs_server.cc:162: No existing server cluster ID found. Generating new ID: 6f68eb153a4988eeedd86a2d23f63be32aa3d4d94cdb167746589496
[2024-04-29 21:39:47,765 I 57 57] (gcs_server) gcs_server.cc:648: Autoscaler V2 enabled: 0
[2024-04-29 21:39:47,766 I 57 57] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 6379.
[2024-04-29 21:39:47,779 I 57 57] (gcs_server) gcs_server.cc:250: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager:
- GetAllAvailableResources request count0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2024-04-29 21:39:47,779 I 57 57] (gcs_server) gcs_server.cc:844: Event stats:


Global stats: 21 total (10 active)
Queueing time: mean = 2.096 ms, max = 14.874 ms, min = 1.302 us, total = 44.026 ms
Execution time:  mean = 710.558 us, total = 14.922 ms
Event stats:
	InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 5.073 us, total = 25.367 us, Queueing time: mean = 22.788 us, max = 23.974 us, min = 21.430 us, total = 113.942 us
	PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 821.500 ns, total = 3.286 us, Queueing time: mean = 7.431 ms, max = 14.874 ms, min = 14.849 ms, total = 29.723 ms
	GcsInMemoryStore.Put - 3 total (0 active), Execution time: mean = 4.962 ms, total = 14.886 ms, Queueing time: mean = 4.729 ms, max = 14.184 ms, min = 1.302 us, total = 14.188 ms
	ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 7.244 us, total = 7.244 us, Queueing time: mean = 1.843 us, max = 1.843 us, min = 1.843 us, total = 1.843 us


[2024-04-29 21:39:47,779 I 57 57] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:


Global stats: 2 total (1 active)
Queueing time: mean = 153.376 us, max = 306.752 us, min = 306.752 us, total = 306.752 us
Execution time:  mean = 3.762 us, total = 7.524 us
Event stats:
	GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 7.524 us, total = 7.524 us, Queueing time: mean = 306.752 us, max = 306.752 us, min = 306.752 us, total = 306.752 us


[2024-04-29 21:39:49,621 I 57 57] (gcs_server) gcs_node_manager.cc:55: Registering node info, node id = a4b60b14d7d6f3bc891c43795fcb9fe6bd440a6219d537b8821ce87c, address = 172.17.0.8, node name = 172.17.0.8
[2024-04-29 21:39:49,621 I 57 57] (gcs_server) gcs_node_manager.cc:61: Finished registering node info, node id = a4b60b14d7d6f3bc891c43795fcb9fe6bd440a6219d537b8821ce87c, address = 172.17.0.8, node name = 172.17.0.8
[2024-04-29 21:39:49,621 I 57 57] (gcs_server) gcs_placement_group_manager.cc:797: A new node: a4b60b14d7d6f3bc891c43795fcb9fe6bd440a6219d537b8821ce87c registered, will try to reschedule all the infeasible placement groups.

hebiao064 avatar Apr 29 '24 21:04 hebiao064

Thanks for reporting the issue. I can definitely help you debug it further.

Unfortunately from the logs I couldn't determine the root cause of the failure. It seemed that dashboard_agent couldn't connect to anything or raylet was terminated unexpectedly.

If you can share all logs of the head node that would be great. If you can try to provide a repro script without your private deps that would be highly appreciated!

hongchaodeng avatar Apr 30 '24 18:04 hongchaodeng

@hongchaodeng

Hi Hongchao, thanks for reply, here are the logs:

jobuser [ /tmp/ray/session_latest/logs ]$ for f in *.{log,err,out}; do echo "********************** $f: *********************"; cat "$f"; echo; done
********************** dashboard.log: *********************
2024-05-03 18:47:22,929	INFO head.py:150 -- Dashboard head grpc address: 0.0.0.0:44065
2024-05-03 18:47:22,938	INFO head.py:254 -- Starting dashboard metrics server on port 44227
2024-05-03 18:47:22,940	INFO utils.py:112 -- Get all modules by type: DashboardHeadModule
2024-05-03 18:47:23,411	INFO utils.py:145 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>, <class 'ray.dashboard.modules.data.data_head.DataHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.state.state_head.StateHead'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
2024-05-03 18:47:23,412	INFO head.py:223 -- Modules to load: {'APIHead', 'JobHead', 'HealthzHead', 'ActorHead', 'MetricsHead', 'EventHead', 'ServeRestApiImpl', 'ReportHead', 'DataHead', 'NodeHead', 'UsageStatsHead', 'StateHead'}
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.data.data_head.DataHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>
2024-05-03 18:47:23,412	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>
2024-05-03 18:47:23,413	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>
2024-05-03 18:47:23,413	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.state.state_head.StateHead'>
2024-05-03 18:47:23,413	INFO head.py:226 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>
2024-05-03 18:47:23,413	INFO head.py:239 -- Loaded 12 modules. [<ray.dashboard.modules.actor.actor_head.ActorHead object at 0x7ddffe12b100>, <ray.dashboard.modules.metrics.metrics_head.MetricsHead object at 0x7ddffe12b190>, <ray.dashboard.modules.data.data_head.DataHead object at 0x7ddfeeb56ce0>, <ray.dashboard.modules.event.event_head.EventHead object at 0x7ddfeeb54190>, <ray.dashboard.modules.healthz.healthz_head.HealthzHead object at 0x7ddfeeb54100>, <ray.dashboard.modules.job.job_head.JobHead object at 0x7ddfeeb57fa0>, <ray.dashboard.modules.node.node_head.NodeHead object at 0x7ddfeeb56bc0>, <ray.dashboard.modules.reporter.reporter_head.ReportHead object at 0x7ddfeeb56ad0>, <ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl object at 0x7ddfeeb56a40>, <ray.dashboard.modules.snapshot.snapshot_head.APIHead object at 0x7ddfeeb54a90>, <ray.dashboard.modules.state.state_head.StateHead object at 0x7ddfeeb57ac0>, <ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead object at 0x7ddfeeb57790>]
2024-05-03 18:47:23,413	INFO head.py:329 -- Initialize the http server.
2024-05-03 18:47:23,413	INFO http_server_head.py:81 -- Setup static dir for dashboard: /home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/dashboard/client/build
2024-05-03 18:47:23,417	INFO http_server_head.py:249 -- Dashboard head http address: 172.17.0.8:8265
2024-05-03 18:47:23,417	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /logical/actors> -> <function ActorHead.get_all_actors[cache ttl=2, max_size=128] at 0x7ddff85be8c0>
2024-05-03 18:47:23,417	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /logical/actors/{actor_id}> -> <function ActorHead.get_actor[cache ttl=2, max_size=128] at 0x7ddff85bea70>
2024-05-03 18:47:23,417	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/grafana_health> -> <function MetricsHead.grafana_health at 0x7ddff69b5900>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/prometheus_health> -> <function MetricsHead.prometheus_health at 0x7ddff69b5a20>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/data/datasets/{job_id}> -> <function DataHead.get_datasets at 0x7ddff69b5ea0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /events> -> <function EventHead.get_event[cache ttl=2, max_size=128] at 0x7ddff69b7400>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/gcs_healthz> -> <function HealthzHead.health_check at 0x7ddff69b7c70>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/version> -> <function JobHead.get_version at 0x7ddff5b25990>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.get_package at 0x7ddff5b25ab0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [PUT] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.upload_package at 0x7ddff5b25bd0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [POST] <PlainResource  /api/jobs/> -> <function JobHead.submit_job at 0x7ddff5b25cf0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [POST] <DynamicResource  /api/jobs/{job_or_submission_id}/stop> -> <function JobHead.stop_job at 0x7ddff5b25e10>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [DELETE] <DynamicResource  /api/jobs/{job_or_submission_id}> -> <function JobHead.delete_job at 0x7ddff5b25f30>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_or_submission_id}> -> <function JobHead.get_job_info at 0x7ddff5b26050>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/jobs/> -> <function JobHead.list_jobs at 0x7ddff5b26170>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_or_submission_id}/logs> -> <function JobHead.get_job_logs at 0x7ddff5b26290>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_or_submission_id}/logs/tail> -> <function JobHead.tail_job_logs at 0x7ddff5b263b0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /internal/node_module> -> <function NodeHead.get_node_module_internal_state at 0x7ddff4564c10>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /nodes> -> <function NodeHead.get_all_nodes[cache ttl=2, max_size=128] at 0x7ddff4564dc0>
2024-05-03 18:47:23,418	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /nodes/{node_id}> -> <function NodeHead.get_node[cache ttl=2, max_size=128] at 0x7ddff4564f70>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/cluster_metadata> -> <function ReportHead.get_cluster_metadata at 0x7ddfed566c20>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x7ddfed566d40>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /task/traceback> -> <function ReportHead.get_task_traceback at 0x7ddfed566f80>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /task/cpu_profile> -> <function ReportHead.get_task_cpu_profile at 0x7ddfed5670a0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /worker/traceback> -> <function ReportHead.get_traceback at 0x7ddfed5671c0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /worker/cpu_profile> -> <function ReportHead.cpu_profile at 0x7ddfed5672e0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /memory_profile> -> <function ReportHead.memory_profile at 0x7ddfed567400>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/ray/version> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_version at 0x7ddfed5785e0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_serve_instance_details at 0x7ddfed578670>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.delete_serve_applications at 0x7ddfed5788b0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [PUT] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.put_all_applications at 0x7ddfed578a60>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/actors/kill> -> <function APIHead.kill_actor_gcs at 0x7ddfed579630>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/component_activities> -> <function APIHead.get_component_activities at 0x7ddfed579750>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/actors> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a320>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/jobs> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a4d0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/nodes> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a680>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/placement_groups> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a830>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/workers> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddff27f1ea0>
2024-05-03 18:47:23,419	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/tasks> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a170>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/objects> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57a9e0>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/runtime_envs> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57ab90>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/cluster_events> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57ad40>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/logs> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57aef0>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/v0/logs/{media_type}> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57b0a0>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/tasks/summarize> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57b2e0>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/actors/summarize> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57b490>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/objects/summarize> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57b640>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /api/v0/tasks/timeline> -> <function RateLimitedModule.enforce_max_concurrent_calls.<locals>.async_wrapper at 0x7ddfed57b7f0>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <DynamicResource  /api/v0/delay/{delay_s}> -> <function StateHead.delayed_response at 0x7ddfed57b910>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /usage_stats_enabled> -> <function UsageStatsHead.get_usage_stats_enabled at 0x7ddfed57bc70>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /> -> <function HttpServerDashboardHead.get_index at 0x7ddfeeb50700>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <PlainResource  /favicon.ico> -> <function HttpServerDashboardHead.get_favicon at 0x7ddfeeb50820>
2024-05-03 18:47:23,420	INFO http_server_head.py:255 -- <ResourceRoute [GET] <StaticResource  /static -> PosixPath('/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource  /static -> PosixPath('/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/dashboard/client/build/static')>>
2024-05-03 18:47:23,420	INFO http_server_head.py:256 -- Registered 53 routes.
2024-05-03 18:47:23,420	INFO head.py:332 -- http server initialized at 172.17.0.8:8265
2024-05-03 18:47:23,448	INFO event_utils.py:132 -- Monitor events logs modified after 1714760243.0728657 on /tmp/ray/session_2024-05-03_18-47-21_547848_55/logs/events, the source types are all.
2024-05-03 18:47:23,450	INFO usage_stats_head.py:191 -- Usage reporting is disabled.
2024-05-03 18:47:23,451	INFO actor_head.py:101 -- Getting all actor info from GCS.
2024-05-03 18:47:23,456	INFO actor_head.py:123 -- Received 0 actor info from GCS.

********************** dashboard_agent.log: *********************
2024-05-03 18:47:24,879	INFO agent.py:138 -- Dashboard agent grpc address: 0.0.0.0:57408
2024-05-03 18:47:24,881	INFO utils.py:112 -- Get all modules by type: DashboardAgentModule
2024-05-03 18:47:25,361	INFO utils.py:145 -- Available modules: [<class 'ray.dashboard.modules.event.event_agent.EventAgent'>, <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>, <class 'ray.dashboard.modules.job.job_agent.JobAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>, <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>]
2024-05-03 18:47:25,361	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2024-05-03 18:47:25,362	INFO event_agent.py:38 -- Event agent cache buffer size: 10240
2024-05-03 18:47:25,362	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>
2024-05-03 18:47:25,362	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.job.job_agent.JobAgent'>
2024-05-03 18:47:25,362	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2024-05-03 18:47:25,362	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>
2024-05-03 18:47:25,362	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2024-05-03 18:47:25,364	INFO agent.py:154 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>
2024-05-03 18:47:25,364	INFO agent.py:159 -- Loaded 7 modules.
2024-05-03 18:47:25,366	INFO http_server_agent.py:71 -- Dashboard agent http address: 0.0.0.0:52365
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/local_raylet_healthz> -> <function HealthzAgent.health_check at 0x7e87642cf250>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/local_raylet_healthz> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [POST] <PlainResource  /api/job_agent/jobs/> -> <function JobAgent.submit_job at 0x7e8764107e20>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/job_agent/jobs/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [POST] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <function JobAgent.stop_job at 0x7e876413c040>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <function JobAgent.delete_job at 0x7e876413c1f0>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <function JobAgent.get_job_logs at 0x7e876413c3a0>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <function JobAgent.tail_job_logs at 0x7e876413c550>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/ray/version> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_version at 0x7e872c77f400>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/ray/version> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,367	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_serve_instance_details at 0x7e872c77f490>
2024-05-03 18:47:25,368	INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.delete_serve_applications at 0x7e872c77f6d0>
2024-05-03 18:47:25,368	INFO http_server_agent.py:78 -- <ResourceRoute [PUT] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.put_all_applications at 0x7e872c77f880>
2024-05-03 18:47:25,368	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/applications/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,368	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-05-03_18-47-21_547848_55/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-05-03_18-47-21_547848_55/logs')>>
2024-05-03 18:47:25,368	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2024-05-03_18-47-21_547848_55/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7e8744156830>>
2024-05-03 18:47:25,368	INFO http_server_agent.py:79 -- Registered 20 routes.
2024-05-03 18:47:25,369	INFO process_watcher.py:46 -- raylet pid is 182
2024-05-03 18:47:25,369	INFO process_watcher.py:66 -- check_parent_via_pipe
2024-05-03 18:47:25,370	INFO event_agent.py:56 -- Report events to 172.17.0.8:44065
2024-05-03 18:47:25,371	INFO event_utils.py:132 -- Monitor events logs modified after 1714760245.060608 on /tmp/ray/session_2024-05-03_18-47-21_547848_55/logs/events, the source types are all.
2024-05-03 18:48:26,722	INFO agent.py:220 -- Terminated Raylet: ip=172.17.0.8, node_id=7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6. _check_parent_via_pipe: The parent is dead.
2024-05-03 18:48:26,722	ERROR process_watcher.py:116 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 12.661 us, max = 12.661 us, min = 12.661 us, total = 12.661 us
    [state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.015 ms, total = 1.015 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 183.694 us, total = 183.694 us, Queueing time: mean = 14.684 us, max = 14.684 us, min = 14.684 us, total = 14.684 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 871.803 us, total = 871.803 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.223 us, total = 1.223 us, Queueing time: mean = 12.640 us, max = 12.640 us, min = 12.640 us, total = 12.640 us
    [state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 973.904 us, total = 973.904 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2024-05-03 18:47:24,576 I 182 182] (raylet) accessor.cc:627: Received notification for node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, IsAlive = 1

2024-05-03 18:48:27,997	ERROR thread.py:58 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2024-05-03 18:49:26,737	ERROR utils.py:211 -- Failed to publish error: Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 12.661 us, max = 12.661 us, min = 12.661 us, total = 12.661 us
    [state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.015 ms, total = 1.015 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 183.694 us, total = 183.694 us, Queueing time: mean = 14.684 us, max = 14.684 us, min = 14.684 us, total = 14.684 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 871.803 us, total = 871.803 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.223 us, total = 1.223 us, Queueing time: mean = 12.640 us, max = 12.640 us, min = 12.640 us, total = 12.640 us
    [state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 973.904 us, total = 973.904 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2024-05-03 18:47:24,576 I 182 182] (raylet) accessor.cc:627: Received notification for node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, IsAlive = 1
 [type raylet_died]
Traceback (most recent call last):
  File "/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/_private/utils.py", line 207, in publish_error_to_driver
    gcs_publisher.publish_error(
  File "python/ray/_raylet.pyx", line 3036, in ray._raylet.GcsPublisher.publish_error
  File "python/ray/_raylet.pyx", line 573, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Failed to publish after retries: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.8:6379: Failed to connect to remote host: Connection refused

********************** log_monitor.log: *********************
2024-05-03 18:47:23,857	INFO log_monitor.py:155 -- Starting log monitor with [max open files=200], [is_autoscaler_v2=False]
2024-05-03 18:47:23,858	INFO log_monitor.py:273 -- Beginning to track file raylet.err
2024-05-03 18:47:23,858	INFO log_monitor.py:273 -- Beginning to track file monitor.log
2024-05-03 18:47:23,858	INFO log_monitor.py:273 -- Beginning to track file gcs_server.err

********************** monitor.log: *********************
2024-05-03 18:47:22,920	INFO monitor.py:688 -- Starting monitor using ray installation: /home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/__init__.py
2024-05-03 18:47:22,920	INFO monitor.py:689 -- Ray version: 2.11.0
2024-05-03 18:47:22,920	INFO monitor.py:690 -- Ray commit: 2eb4a8119f903f79b78c01eaa9db06c6c390051c
2024-05-03 18:47:22,920	INFO monitor.py:691 -- Monitor started with command: ['/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2024-05-03_18-47-21_547848_55/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=172.17.0.8:6379', '--monitor-ip=172.17.0.8']
2024-05-03 18:47:22,926	INFO monitor.py:159 -- session_name: session_2024-05-03_18-47-21_547848_55
2024-05-03 18:47:22,927	INFO monitor.py:191 -- Starting autoscaler metrics server on port 44217
2024-05-03 18:47:22,928	INFO monitor.py:216 -- Monitor: Started
2024-05-03 18:47:22,934	INFO autoscaler.py:280 -- disable_node_updaters:False
2024-05-03 18:47:22,935	INFO autoscaler.py:288 -- disable_launch_config_check:True
2024-05-03 18:47:22,935	INFO autoscaler.py:300 -- foreground_node_launch:False
2024-05-03 18:47:22,935	INFO autoscaler.py:310 -- worker_liveness_check:True
2024-05-03 18:47:22,935	INFO autoscaler.py:318 -- worker_rpc_drain:True
2024-05-03 18:47:22,936	INFO autoscaler.py:368 -- StandardAutoscaler: {'cluster_name': 'default', 'max_workers': 0, 'upscaling_speed': 1.0, 'docker': {}, 'idle_timeout_minutes': 0, 'provider': {'type': 'readonly', 'use_node_id_as_ip': True, 'disable_launch_config_check': True}, 'auth': {}, 'available_node_types': {'ray.head.default': {'resources': {}, 'node_config': {}, 'max_workers': 0}}, 'head_node_type': 'ray.head.default', 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': [], 'rsync_filter': [], 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': []}
2024-05-03 18:47:22,937	INFO monitor.py:383 -- Autoscaler has not yet received load metrics. Waiting.

********************** runtime_env_agent.log: *********************
2024-05-03 18:47:24,956	INFO runtime_env_agent.py:241 -- Starting runtime env agent at pid 236
2024-05-03 18:47:24,956	INFO runtime_env_agent.py:242 -- Parent raylet pid is 182
2024-05-03 18:47:24,956	INFO runtime_env_agent.py:243 -- Listening to address 172.17.0.8, port 51763
2024-05-03 18:47:26,661	INFO main.py:198 -- Raylet is dead! Exiting Runtime Env Agent. addr: 172.17.0.8, port: 51763
_check_parent_via_pipe: The parent is dead.
2024-05-03 18:48:26,679	INFO main.py:220 -- SystemExit! 0

********************** agent-424238335.err: *********************

********************** dashboard.err: *********************

********************** gcs_server.err: *********************

********************** log_monitor.err: *********************

********************** monitor.err: *********************

********************** ray_client_server.err: *********************
2024-05-03 18:47:23,178	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='172.17.0.8:6379', redis_password=None, runtime_env_agent_address='http://172.17.0.8:51763')

********************** raylet.err: *********************

********************** runtime_env_agent.err: *********************
Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 12.661 us, max = 12.661 us, min = 12.661 us, total = 12.661 us
    [state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.015 ms, total = 1.015 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 183.694 us, total = 183.694 us, Queueing time: mean = 14.684 us, max = 14.684 us, min = 14.684 us, total = 14.684 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 871.803 us, total = 871.803 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.223 us, total = 1.223 us, Queueing time: mean = 12.640 us, max = 12.640 us, min = 12.640 us, total = 12.640 us
    [state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 973.904 us, total = 973.904 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2024-05-03 18:47:24,576 I 182 182] (raylet) accessor.cc:627: Received notification for node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, IsAlive = 1

Failed to publish error: Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 12.661 us, max = 12.661 us, min = 12.661 us, total = 12.661 us
    [state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.015 ms, total = 1.015 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 183.694 us, total = 183.694 us, Queueing time: mean = 14.684 us, max = 14.684 us, min = 14.684 us, total = 14.684 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 871.803 us, total = 871.803 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.223 us, total = 1.223 us, Queueing time: mean = 12.640 us, max = 12.640 us, min = 12.640 us, total = 12.640 us
    [state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 973.904 us, total = 973.904 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump]
    [state-dump]
    [2024-05-03 18:47:24,576 I 182 182] (raylet) accessor.cc:627: Received notification for node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, IsAlive = 1
 [type raylet_died]
Traceback (most recent call last):
  File "/home/jobuser/build/cate-estimation/environments/satellites/python/lib/python3.10/site-packages/ray/_private/utils.py", line 207, in publish_error_to_driver
    gcs_publisher.publish_error(
  File "python/ray/_raylet.pyx", line 3036, in ray._raylet.GcsPublisher.publish_error
  File "python/ray/_raylet.pyx", line 573, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: Failed to publish after retries: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.17.0.8:6379: Failed to connect to remote host: Connection refused

********************** agent-424238335.out: *********************

********************** gcs_server.out: *********************
[2024-05-03 18:47:21,584 I 57 57] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-05-03 18:47:21,594 I 57 57] (gcs_server) event.cc:234: Set ray event level to warning
[2024-05-03 18:47:21,594 I 57 57] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2024-05-03 18:47:21,594 I 57 57] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-05-03 18:47:21,597 I 57 57] (gcs_server) gcs_server.cc:162: No existing server cluster ID found. Generating new ID: 34786fe3933425dbe7e0ba581406f5931c764d82aaa0ded56f348768
[2024-05-03 18:47:21,599 I 57 57] (gcs_server) gcs_server.cc:648: Autoscaler V2 enabled: 0
[2024-05-03 18:47:21,600 I 57 57] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 6379.
[2024-05-03 18:47:21,618 I 57 57] (gcs_server) gcs_server.cc:250: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager:
- GetAllAvailableResources request count0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0


[2024-05-03 18:47:21,618 I 57 57] (gcs_server) gcs_server.cc:844: Event stats:


Global stats: 21 total (10 active)
Queueing time: mean = 2.818 ms, max = 20.543 ms, min = 1.423 us, total = 59.169 ms
Execution time:  mean = 1.005 ms, total = 21.097 ms
Event stats:
	InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 5.051 us, total = 25.257 us, Queueing time: mean = 33.022 us, max = 50.214 us, min = 21.080 us, total = 165.109 us
	PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 1.290 us, total = 5.160 us, Queueing time: mean = 10.113 ms, max = 20.543 ms, min = 19.908 ms, total = 40.451 ms
	GcsInMemoryStore.Put - 3 total (0 active), Execution time: mean = 7.020 ms, total = 21.059 ms, Queueing time: mean = 6.184 ms, max = 18.547 ms, min = 1.423 us, total = 18.551 ms
	RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 7.564 us, total = 7.564 us, Queueing time: mean = 1.733 us, max = 1.733 us, min = 1.733 us, total = 1.733 us


[2024-05-03 18:47:21,618 I 57 57] (gcs_server) gcs_server.cc:845: GcsTaskManager Event stats:


Global stats: 2 total (1 active)
Queueing time: mean = 158.688 us, max = 317.376 us, min = 317.376 us, total = 317.376 us
Execution time:  mean = 3.426 us, total = 6.853 us
Event stats:
	GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
	PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 6.853 us, total = 6.853 us, Queueing time: mean = 317.376 us, max = 317.376 us, min = 317.376 us, total = 317.376 us


[2024-05-03 18:47:24,573 I 57 57] (gcs_server) gcs_node_manager.cc:55: Registering node info, node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, address = 172.17.0.8, node name = 172.17.0.8
[2024-05-03 18:47:24,573 I 57 57] (gcs_server) gcs_node_manager.cc:61: Finished registering node info, node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, address = 172.17.0.8, node name = 172.17.0.8
[2024-05-03 18:47:24,573 I 57 57] (gcs_server) gcs_placement_group_manager.cc:797: A new node: 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6 registered, will try to reschedule all the infeasible placement groups.

********************** monitor.out: *********************

********************** ray_client_server.out: *********************

********************** raylet.out: *********************
[2024-05-03 18:47:23,542 I 182 182] (raylet) main.cc:176: Setting cluster ID to: 34786fe3933425dbe7e0ba581406f5931c764d82aaa0ded56f348768
[2024-05-03 18:47:23,555 I 182 182] (raylet) main.cc:239: Raylet is not set to kill unknown children.
[2024-05-03 18:47:23,556 I 182 182] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-05-03 18:47:23,559 I 182 182] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 9.78537GB of memory.
[2024-05-03 18:47:23,559 I 182 182] (raylet) store_runner.cc:48: Starting object store with directory /tmp, fallback /tmp/ray, and huge page support disabled
[2024-05-03 18:47:23,560 I 182 206] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(9785442312, /tmp/plasmaXXXXXX)
[2024-05-03 18:47:23,562 I 182 206] (raylet) store.cc:564: ========== Plasma store: =================
Current usage: 0 / 9.78537 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2024-05-03 18:47:24,562 I 182 182] (raylet) grpc_server.cc:129: ObjectManager server started, listening on port 42643.
[2024-05-03 18:47:24,563 I 182 182] (raylet) worker_killing_policy.cc:101: Running GroupByOwner policy.
[2024-05-03 18:47:24,564 I 182 182] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 31665160192 bytes (0.95 system memory), total system memory bytes: 33331748864
[2024-05-03 18:47:24,564 I 182 182] (raylet) node_manager.cc:282: Initializing NodeManager with ID 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6
[2024-05-03 18:47:24,565 I 182 182] (raylet) grpc_server.cc:129: NodeManager server started, listening on port 36095.
[2024-05-03 18:47:24,570 I 182 235] (raylet) agent_manager.cc:78: Monitor agent process with name dashboard_agent/424238335
[2024-05-03 18:47:24,571 I 182 237] (raylet) agent_manager.cc:78: Monitor agent process with name runtime_env_agent
[2024-05-03 18:47:24,572 I 182 182] (raylet) event.cc:234: Set ray event level to warning
[2024-05-03 18:47:24,572 I 182 182] (raylet) event.cc:342: Ray Event initialized for RAYLET
[2024-05-03 18:47:24,573 I 182 182] (raylet) raylet.cc:128: Raylet of id, 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6 started. Raylet consists of node_manager and object_manager. node_manager address: 172.17.0.8:36095 object_manager address: 172.17.0.8:42643 hostname: ba82c9c153df
[2024-05-03 18:47:24,576 I 182 182] (raylet) node_manager.cc:502: [state-dump] NodeManager:
[state-dump] Node ID: 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6
[state-dump] Node name: 172.17.0.8
[state-dump] InitialConfigResources: {node:__internal_head__: 10000, object_store_memory: 97853657080000, CPU: 80000, memory: 195707314190000, node:172.17.0.8: 10000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: -8079360113863020394 Local resources: {"total":{memory: [195707314190000], CPU: [80000], node:__internal_head__: [10000], node:172.17.0.8: [10000], object_store_memory: [97853657080000]}}, "available": {memory: [195707314190000], CPU: [80000], node:__internal_head__: [10000], node:172.17.0.8: [10000], object_store_memory: [97853657080000]}}, "labels":{"ray.io/node_id":"7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6",} is_draining: 0 is_idle: 1 Cluster resources: node id: -8079360113863020394{"total":{node:__internal_head__: 10000, node:172.17.0.8: 10000, object_store_memory: 97853657080000, memory: 195707314190000, CPU: 80000}}, "available": {node:__internal_head__: 10000, node:172.17.0.8: 10000, object_store_memory: 97853657080000, memory: 195707314190000, CPU: 80000}}, "labels":{"ray.io/node_id":"7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 9785365708
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 28 total (13 active)
[state-dump] Queueing time: mean = 1.228 ms, max = 8.544 ms, min = 12.640 us, total = 34.382 ms
[state-dump] Execution time:  mean = 36.544 ms, total = 1.023 s
[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 215.439 us, total = 2.370 ms, Queueing time: mean = 3.122 ms, max = 8.544 ms, min = 19.777 us, total = 34.342 ms
[state-dump] 	ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.018 s, total = 1.018 s, Queueing time: mean = 12.661 us, max = 12.661 us, min = 12.661 us, total = 12.661 us
[state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.015 ms, total = 1.015 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 183.694 us, total = 183.694 us, Queueing time: mean = 14.684 us, max = 14.684 us, min = 14.684 us, total = 14.684 us
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 871.803 us, total = 871.803 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 1.223 us, total = 1.223 us, Queueing time: mean = 12.640 us, max = 12.640 us, min = 12.640 us, total = 12.640 us
[state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 973.904 us, total = 973.904 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
[2024-05-03 18:47:24,576 I 182 182] (raylet) accessor.cc:627: Received notification for node id = 7af1edee4d23c34e57a88f6a76f124a5539eed534942994fa99c94c6, IsAlive = 1

********************** runtime_env_agent.out: *********************
======== Running on http://0.0.0.0:51763 ========
(Press CTRL+C to quit)

jobuser [ /tmp/ray/session_latest/logs/events ]$ for f in *.{log,err,out}; do echo "********************** $f: *********************"; cat "$f"; echo; done
********************** event_AUTOSCALER.log: *********************

********************** event_GCS.log: *********************

********************** event_RAYLET.log: *********************

********************** *.err: *********************
cat: '*.err': No such file or directory

********************** *.out: *********************
cat: '*.out': No such file or directory

I will try provide a repro script, and I also noticed that we are using aiohttp: 3.9.3, and ray require >=3.7 and < 3.9not sure whether it's the root cause, I'll try pin it down and rerun ray command: ray start --head --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --dashboard-host=0.0.0.0 --disable-usage-stats --verbose

pip list | grep aio
aiobotocore                            2.7.0
aiodns                                 2.0.0
aiohttp                                3.9.3
aiohttp-cors                           0.7.0
aioitertools                           0.11.0
aiosignal                              1.3.1

hebiao064 avatar May 03 '24 19:05 hebiao064

OK, I tried uninstall aiohttp and reinstall 3.8, it still fail:

   40  pip uninstall aiohttp
   41  pip install aiohttp==3.8
   42  ray start --head  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --disable-usage-stats --verbose

Some Ray subprocesses exited unexpectedly:
  dashboard [exit code=-11]

Remaining processes will be killed.

hebiao064 avatar May 03 '24 19:05 hebiao064

The GCS seemed to be running. I couldn't detect any obvious error at first sight. Could you share a repro script that we can use to reproduce the issue?

hongchaodeng avatar May 05 '24 05:05 hongchaodeng

We uninstalled many internal packages and the image is working now, though we still don't know the root case, it's not blocking us anymore.

@hongchaodeng thanks for help, we can close the issue.

hebiao064 avatar May 10 '24 17:05 hebiao064