ray
ray copied to clipboard
Revert "Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo …"
Reverts ray-project/ray#27613
@rkooo567
- windows://python/ray/tests:test_reference_counting: It seems to be related. After this PR, this test was slowed by about 80s, it cause timeout, I increased the timeout and it seems to work:(The failed one is
python/ray/tests/test_gcs_pubsub.py::test_two_subscribers
I will talk about it later) - windows://python/ray/tests:test_queue: It seems not related:
- test_trainer and Test tune restore: I'm not sure, I think this PR is not the root case, but it will increase the ratio of failed.
- python/ray/tests/test_gcs_pubsub.py::test_two_subscribers: I found that this test also has a low probability of failure. The root cause is that the monitoring process will initialize the
StandardAutoscaler
, and some logs will be printed here, which will eventually be published byGcsPublisher
. Before this PR,ray.init
will be one second faster, this test can be done before publishing this log, I tried to receive one more message and the result is as follows:
@rkooo567 I compared the windows tests before and after this PR, and it seems that the time-consuming of many tests has increased significantly, as well as these several reverts, I found that this PR will increase the probability of many test failures.
I have a new proposal:
- The sending time of the RPC request RegisterSelf is the same as before, so the node ready time is the same as before.
- I add a new RPC request
RegisterAgentInfoToGcs
, it will sending to GCS on methodAgentManager::HandleRegisterAgent
and write the agent info toGcsNodeInfo
, I will writeAgentManger::reported_agent_info_
on the reply callback ofRegisterAgentInfoToGcs
, This ensures that there will only be two results afterRayConfig::instance().agent_register_timeout_ms()
seconds, theAgentInfo
is written toGCSNodeInfo
, or the AgentInfo registration times out and the raylet exits.
Compared with the previous implementation, we need to process this case that the raylet registration is completed but there is no AgentInfo
on the GCSNodeInfo
, and the waiting time for the AgentInfo
is at most A seconds.
Do you think we should fix those flaky tests or the new implementation?
Do you think we should fix those flaky tests or the new implementation?
Yeah I think this solution seems reasonable. Agreed many tests are broken due to 1 second add_node delay (it is technically not this PR's fault, and the problem is that those tests are written badly... but we have no choice unfortunately)
but there is no AgentInfo on the GCSNodeInfo, and the waiting time for the AgentInfo is at most A seconds.
This is technically the same as before when we use internal KV, right?
@rkooo567
The difference is that both AgentInfo
and NodeInfo
exist on GCSNodeInfo
, which is better than internalKV
:
- InternalKV data will leak, there is no place to delete it after Raylet dies.
-
node_head
does not need to access internalKV when updating data, so the efficiency will be better
But timing issues should be the same as before.
But timing issues should be the same as before.
Yeah this is what I meant! No brainer it's better including this information to node table instead of internal kv.
What's the status for this PR?
cc @Catch-Bull
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!