ray [Core] GCS FT with redis sentinel

Why are these changes needed?

We want to use redis sentinel to support Ray GCS FT. I opened a ticket here: https://github.com/ray-project/ray/issues/46983. Redis sentinel should provide high availability without worrying too much about redis cluster operations.

Related issue number

Closes #46983

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [x] Release tests
- [ ] This PR is not tested :(

Aug 26 '24 13:08 kanwang

@jjyao @rkooo567 would you be able to help review this?

Aug 26 '24 13:08 kanwang

hey @rkooo567 gentle ping on this PR.

Sep 08 '24 20:09 kanwang

Hey, Gentle ping on this. Could I get some review please? thank you!

Sep 24 '24 16:09 kanwang

@kanwang per Ray Slack; aiming to get to this early next week

Oct 09 '24 19:10 anyscalesam

hey @anyscalesam Any movement here? sorry for keep bugging you guys!

Oct 17 '24 13:10 kanwang

Hi @kanwang sorry, the team is a bit busy right now. We will review it ASAP.

Oct 17 '24 15:10 jjyao

Hey team, any chance that we can get this reviewed this week or early next week? Thank you! cc @jjyao

Oct 25 '24 13:10 kanwang

hey @jjyao gentle ping on this. thank you!

Nov 06 '24 14:11 kanwang

hey @jjyao checking on this again since all tests passed now.

Nov 13 '24 18:11 kanwang

Hi @kanwang I'll take another review this week.

Nov 13 '24 19:11 jjyao

Thanks! Addressed the feedbacks.

Nov 18 '24 22:11 kanwang

Have you tested with real Redis Sentinel?

Yes, I did test it on redis sentinel locally (with a remote redis sentinel instance).

When we connect to the primary, IsRedisSentinel will return false now?

That is correct. Sentinel is basically a special redis instance that runs in front of redis and its replicas. It knows the topology/failover of the redis instance and its replica. Once you queried the primary address, you basically connect to redis instance directly.

One thing that is still missing is reconnecting. when redis failover, the primary replica information is updated in sentinel, but Ray won't re-connect until head restart. It's acceptable for us for the moment. I've opened an issue before https://github.com/ray-project/ray/issues/46982 and I saw there was a more recent issue being worked on https://github.com/ray-project/ray/issues/47419. They are not exactly same issue (v.s. sentinel reconnect), but in general if head tries to re-connect to redis instead of fail, I think all cases will be solved. I will see if there's more updates there. or I can try to contribute that too.

Nov 19 '24 04:11 kanwang

Let me know when it's ready for another review.

Nov 20 '24 06:11 jjyao

Thank you for approving! Looks like all tests passed. Can you help with merging?

Nov 20 '24 18:11 kanwang

Hi @kanwang, does this change already fit your needs? I am a little concerned about the implementation because it apparently does not fully follow the protocol of Redis Sentinel and Redis Cluster.

Nov 30 '24 01:11 rueian

Hey @rueian It does solve our problem at the moment. Based on my understanding this is how a client should be connecting to Redis Sentinel. It is not handling reconnection, which probably requires a bigger change in Ray. What particular part of the protocol is it not following for Redis Sentinel? Let me know - if it's critical/breaking I can send a fix too.

Dec 02 '24 04:12 kanwang

Hi @kanwang, good to know your problem has been solved.

However, the reconnection issue is exactly what I am concerned about. Both Redis Sentinel and Redis Cluster require clients to monitor their topology proactively and connect to the right nodes accordingly, not just at the beginning but throughout the application's lifetime. I know this kind of active monitoring is extremely difficult to implement. But if we don't do that we probably can not say GCS FT supports Sentinel and Cluster publicly because it still fails on Redis topology changes. Another easier way is querying the master before sending a command every time. Also, the way of querying master should be SENTINEL get-master-addr-by-name master-name not SENTINEL MASTERS because there could be multiple master set in a sentinel cluster.

In terms of the Redis Cluster, the DEL DUMMY approach is brilliant, but there are few issues:

The dummy command may need to pass authentication first.
The master of DUMMY is likely not the master of our real command.
Actually, instead of pre-flighting a dummy command, we should just handle the MOVE error every time since the topology may be changed at any time.

Implementing full support to Redis Sentinel and Redis Cluster is extremely hard. Not to say even with the above aggressive reconnection approaches, we still need to keep discovering the latest topology of Sentinel and Cluster, otherwise, GCS FT will still fail if all preconfigured Redis addresses are down.

Dec 02 '24 05:12 rueian

@rueian

Our current solution is that when topology change, GCS will exit and it will be restarted and during restart, it will find the new master. It's not ideal and we want GCS to automatically find the new master without exiting and this is what @MortalHappiness is working on.

Dec 11 '24 19:12 jjyao