ray icon indicating copy to clipboard operation
ray copied to clipboard

[core] Add 1s timeout in RPC to CoreWorkerService.NumPendingTasks in GcsJobManager::HandleGetAllJobInfo

Open rynewang opened this issue 7 months ago • 1 comments

Critical Dashboard API GET /api/jobs sends RPC to JobInfoGcsService.GetAllJobInfo, where the GCS sends RPC to each to CoreWorkerService.NumPendingTasks for the info of "how many running tasks do ya have rn?". This is not mission critical - in Ray and in Product nobody reads that field, other than in tests. But the Dashboard API itself is mission critical, so we set 1s timeout in the inner RPC, and if it times out or failed, we just set is_running_tasks to false.

rynewang avatar Jun 28 '24 21:06 rynewang