ray icon indicating copy to clipboard operation
ray copied to clipboard

Revert "Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo …"

Open Catch-Bull opened this issue 2 years ago • 4 comments

Reverts ray-project/ray#27613

Catch-Bull avatar Aug 09 '22 04:08 Catch-Bull

@rkooo567

  • windows://python/ray/tests:test_reference_counting: It seems to be related. After this PR, this test was slowed by about 80s, it cause timeout, I increased the timeout and it seems to work:(The failed one is python/ray/tests/test_gcs_pubsub.py::test_two_subscribers I will talk about it later) image
  • windows://python/ray/tests:test_queue: It seems not related: image
  • test_trainer and Test tune restore: I'm not sure, I think this PR is not the root case, but it will increase the ratio of failed.
  • python/ray/tests/test_gcs_pubsub.py::test_two_subscribers: I found that this test also has a low probability of failure. The root cause is that the monitoring process will initialize the StandardAutoscaler, and some logs will be printed here, which will eventually be published by GcsPublisher. Before this PR, ray.init will be one second faster, this test can be done before publishing this log, I tried to receive one more message and the result is as follows: image

Catch-Bull avatar Aug 09 '22 15:08 Catch-Bull

@rkooo567 I compared the windows tests before and after this PR, and it seems that the time-consuming of many tests has increased significantly, as well as these several reverts, I found that this PR will increase the probability of many test failures.

I have a new proposal:

  1. The sending time of the RPC request RegisterSelf is the same as before, so the node ready time is the same as before.
  2. I add a new RPC request RegisterAgentInfoToGcs, it will sending to GCS on method AgentManager::HandleRegisterAgent and write the agent info to GcsNodeInfo, I will write AgentManger::reported_agent_info_ on the reply callback of RegisterAgentInfoToGcs, This ensures that there will only be two results after RayConfig::instance().agent_register_timeout_ms() seconds, the AgentInfo is written to GCSNodeInfo, or the AgentInfo registration times out and the raylet exits.

Compared with the previous implementation, we need to process this case that the raylet registration is completed but there is no AgentInfo on the GCSNodeInfo, and the waiting time for the AgentInfo is at most A seconds.

Do you think we should fix those flaky tests or the new implementation?

Catch-Bull avatar Aug 09 '22 15:08 Catch-Bull

Do you think we should fix those flaky tests or the new implementation?

Yeah I think this solution seems reasonable. Agreed many tests are broken due to 1 second add_node delay (it is technically not this PR's fault, and the problem is that those tests are written badly... but we have no choice unfortunately)

but there is no AgentInfo on the GCSNodeInfo, and the waiting time for the AgentInfo is at most A seconds.

This is technically the same as before when we use internal KV, right?

rkooo567 avatar Aug 10 '22 05:08 rkooo567

@rkooo567 The difference is that both AgentInfo and NodeInfo exist on GCSNodeInfo, which is better than internalKV:

  1. InternalKV data will leak, there is no place to delete it after Raylet dies.
  2. node_head does not need to access internalKV when updating data, so the efficiency will be better

But timing issues should be the same as before.

Catch-Bull avatar Aug 10 '22 06:08 Catch-Bull

But timing issues should be the same as before.

Yeah this is what I meant! No brainer it's better including this information to node table instead of internal kv.

rkooo567 avatar Aug 17 '22 12:08 rkooo567

What's the status for this PR?

rkooo567 avatar Aug 26 '22 17:08 rkooo567

cc @Catch-Bull

rkooo567 avatar Sep 01 '22 16:09 rkooo567

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Oct 01 '22 17:10 stale[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Nov 07 '22 06:11 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Nov 22 '22 21:11 stale[bot]