redis icon indicating copy to clipboard operation
redis copied to clipboard

Introduce Shard IDs to logically group nodes in cluster mode

Open PingXie opened this issue 2 years ago • 38 comments

  1. Added a new "shard_id" field to "cluster nodes" output and nodes.conf after "hostname"
  2. Added a new entry "shard_id" to "cluster shards" at the beginning of every shard
  3. Added a new PING extension to propagate "shard_id"
  4. Handled upgrade from pre-7.2 releases automatically
  5. Refactored PING extension assembling/parsing logic

Behavior of Shard IDs:

  1. Replicas will always follow the shards of their reported primaries. If a primary updates its shard ID, the replica will follow. (This need not follow for cluster v2) This is not an expected use case.
Release notes:
Introduce Shard IDs to logically group nodes in cluster mode based off replication. Shard IDs are automatically assigned.
Extend the `CLUSTER SHARDS` command to include the the `SHARD ID` for all shards.

PingXie avatar Apr 05 '22 19:04 PingXie

@PingXie Given that we are after the code cut-off for Redis 7, we probably won't include this, will probably merge this after 7 goes GA.

madolson avatar Apr 05 '22 23:04 madolson

@madolson I see strong value in getting this change rolled out together with CLUSTER SHARDS as it helps complete the shard scenario end 2 end. Additionally, it has externally visible impact in areas like CLUSTER NODES, CLUSTER SHARDS, and nodes.conf, which are already updated in 7.0 so it will be great if we can piggy back on the same release (as opposed to churning these areas release after release). I will close the remaining two issues today (the missing tests and a way to return own shard_id). I am happy to prioritize this fix for 7.0 GA on my end.

@oranagra FYI

PingXie avatar Apr 05 '22 23:04 PingXie

@PingXie I see a lot of your points, and I agree with most of them. I have two concerns. Let's focus on the implementation, if it's ready we can of course make the decision to merge it earlier.

  1. The replica should follow the shard ID of the primary, and reconcile itself automatically.
  2. You should be able to enforce a shard id, since I don't think shard ids should rotate if all nodes in a shard die.

madolson avatar Apr 06 '22 04:04 madolson

@PingXie I see a lot of your points, and I agree with most of them. I have two concerns. Let's focus on the implementation, if it's ready we can of course make the decision to merge it earlier.

Sounds good @madolson.

  1. The replica should follow the shard ID of the primary, and reconcile itself automatically.

Yes, replicas will pick up their primary's shard id upon joining the shard via CLUSTER REPLICATE

  1. You should be able to enforce a shard id, since I don't think shard ids should rotate if all nodes in a shard die.

Shard ids are persisted in nodes.conf. Nodes should be able to retrieve their previous shard ids on restart, assuming their nodes.conf are not corrupted. In the case where replicas lost their nodes.conf, they can still recover their shard ids via the gossip message from their primary, if the primary's nodes.conf is still good. If, for whatever reason, all nodes.conf files in the same shard are lost, we will rotate to a new shard id.

I added a new command CLUSTER MYSHARDID to expose a node's shard id, similar to CLUSTER MYID. The reason for a new command is to reduce backcompat risk but let me know if you think otherwise.

I can update the documentation after we close this PR.

PingXie avatar Apr 06 '22 05:04 PingXie

@madolson I pushed another commit to make CLUSTER SHARDS work better with failed primaries and this should make the shard id support complete.

Below are the externally visible changes (hence my preference to include them along with CLUSTER SHARDS in a major release).

  • Nodes.conf - added a new field in the end point column before hostname
208c9c887b8d3c148ca3b19f6585748816be54cc 127.0.0.1:30002@40002,5ec50a13c34fc1d4db518770c34c04d364b19593,hostname-1 master - 0 1649868449278 2 connected 5461-10922
  1. This is not the most logical place but it seems to me the only extensible column with minimum backcompat risk
  2. With this change, since shard-id is always on, I moved it ahead of hostname. I am assuming this incompatible change is OK as we are still in the RC phase but let me know if you have concerns
  • CLUSTER SHARDS has a new shard-id row
1) "shard-id"
   2) "5ec50a13c34fc1d4db518770c34c04d364b19593"
  • CLUSTER SHARDS now groups failed primaries with their old replicas/new primary in the same shard; previously they are separated after the failover
2) 1) "shard-id"
   2) "9eac95730a6d80fdaac65bfbea09b74647677171"
   3) "slots"
   4) 1) "0"
      2) "5460"
   5) "nodes"
   6) 1)  1) "id"
          2) "f8e798e4ea98bb2339c1b07237aab93121b25664"
          3) "port"
          4) (integer) 30001
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "hostname"
         10) ""
         11) "role"
         12) "master"
         13) "replication-offset"
         14) (integer) 602
         15) "health"
         16) "fail"
      2)  1) "id"
          2) "d0595e25f43a6242482f7380b8039af5a88f4adb"
          3) "port"
          4) (integer) 30005
          5) "ip"
          6) "127.0.0.1"
          7) "endpoint"
          8) "127.0.0.1"
          9) "hostname"
         10) ""
         11) "role"
         12) "master"
         13) "replication-offset"
         14) (integer) 602
         15) "health"
         16) "online"

@zuiderkwast in case you are interested - this is to improve the native shard support started with CLUSTER SHARDS

PingXie avatar Apr 13 '22 17:04 PingXie

Regarding the form ip:port@cport,shard,hostname (which is already a pretty strange format without shard), I have two incompatible suggestions:

  1. Since every part so far uses a different separator (:@,), we could pick yet another separator for shard-id, for example $ (looks like S for shard). Then, we could be able to parse these strings from 7.0 release candidates too (with hostname, without shard), so the format would be ip:port@cport[$shard][,hostname].
  2. Now might also be a good time to make this format future-proof in some way, so that more fields can be added without breaking old nodes and clients. For example, add an URI query string in the end ?shard=s98s98dus98du&something=foo&bla=bla where unknown future keys can be safely ignored.

WDYT?

Great point on future-proofing the format now, especially your 2nd point. It is going to be super hard to introduce new fields in the future if we don't come up with a systematic solution in 7.0, even with just the hostname field. Mulling over your suggestion now ...

PingXie avatar Apr 14 '22 17:04 PingXie

@zuiderkwast I think JSON would be the ideal long term solution. A potential migration/upgrade strategy would be to attempting to load nodes.json first and on not found fall back to loading nodes.conf and immediately writing out nodes.json. This is IMO better than hacking nodes.conf further. So one option could be to stick with ip:port@cport,shard[,hostname] in 7.0 and then make the JSON switch in 7.2. Thoughts?

Here is an example of nodes.json for illustration:

{
    "nodes": {
        "node": {
            "id": "fbc2329fc9e9f2c9f9ed9e6cdcee656df65d191c",
            "ip": "127.0.0.1",
            "port": "30001",
            "cluster-port": "40001",
            "hostname": "hostname-1",
            "role": "master",
            "shard-id": "11829057fd6af31adeb6db3ce43464e51defe0fd",
            "slots": [{
                    "start": 0,
                    "end": 10
                },
                {
                    "start": 10,
                    "end": 20
                },
                {
                    "start": 30,
                    "end": 40
                }
            ],
            "importing": [{
                    "slot": 9,
                    "from": "e6977cc6c45110a05f3e862717b1120b30589498"
                },
                {
                    "slot": 15,
                    "from": "45110a05f3e862717b1120b30589498e6977cc6c"
                }
            ],
            "migrating": [{
                    "slot": 12,
                    "to": "e6977cc6c45110a05f3e862717b1120b30589498"
                },
                {
                    "slot": 34,
                    "to": "45110a05f3e862717b1120b30589498e6977cc6c"
                }
            ]
        }
    }
}

Update - I think an in-place update scheme that reuses the same file name nodes.conf for the JSON content would work fine too.

PingXie avatar Apr 15 '22 01:04 PingXie

@madolson this change is ready. I moved shard-id behind hostname so it should be compatible with 7.0 GA. Can you please take a look?

Pre 7.0: ip:port[@cport] 7.0: ip:port[@cport] or ip:port[@cport],hostname 7.0 + shard-id: ip:port[@cport],,shard_id or ip:port[@cport],hostname,shard_id

PingXie avatar Apr 28 '22 05:04 PingXie

FYI @yossigo @oranagra - in case @madolson's bandwidth is limited recently. This PR brings native shard-id to 7.0.0 in a fully compatible way. It is ready for code review. Please feel free to let me know if there is anything that I can help to make some progress on this PR.

PingXie avatar May 02 '22 06:05 PingXie

@madolson can you please review this change when you get a chance? Thanks!

PingXie avatar May 18 '22 05:05 PingXie

@PingXie Ok!

I don't know about json, we don't have a JSON parser today in Redis and I don't think we really want to take a dependency on one. I think there are easier and more extensible ways to improve node.conf without moving to JSON though. In another PR I proposed having key/value fields both the slots data.

madolson avatar Jun 04 '22 00:06 madolson

I don't know about json, we don't have a JSON parser today in Redis and I don't think we really want to take a dependency on one. I think there are easier and more extensible ways to improve node.conf without moving to JSON though. In another PR I proposed having key/value fields both the slots data.

My primary point is to show that there is a way forward. I don't have a strong opinion that we should use JSON. Can you point me to your other PR where key/value fields were proposed? The idea sounds reasonable to me.

PingXie avatar Jun 09 '22 05:06 PingXie

My primary point is to show that there is a way forward. I don't have a strong opinion that we should use JSON. Can you point me to your other PR where key/value fields were proposed? The idea sounds reasonable to me.

It was discussed here: https://github.com/redis/redis/pull/9564. It still uses the older convention though, one of the two PRs can implement it and we can cross port it.

madolson avatar Jun 20 '22 22:06 madolson

@ushachar Will you take a look at this since we presumably want the same "shard ID" to be the same between cluster v2 and cluster v1.

madolson avatar Jun 21 '22 04:06 madolson

@ushachar Will you take a look at this since we presumably want the same "shard ID" to be the same between cluster v2 and cluster v1.

IDs in flotilla are monotonically increasing numbers (padded in order to preserve compatibility with Cluster v1), but the general concept of this PR does seem to align well with our plans there.

ushachar avatar Jun 23 '22 22:06 ushachar

@ushachar One of the assumptions being discussed here is that ShardIDs may not necessarily be monotonically increasing, and externally imposed.

madolson avatar Jun 28 '22 03:06 madolson

@ushachar One of the assumptions being discussed here is that ShardIDs may not necessarily be monotonically increasing, and externally imposed.

I commented on the "monotonically increasing" property on @ushachar's cluster V2 spec.

PingXie avatar Jul 03 '22 20:07 PingXie

@madolson / @PingXie - Externally imposed is fine - the IDs allocated by Flotilla will still be monotonically increasing (internally, we maintain a counter and verify that the value is indeed unused before returning it)

ushachar avatar Jul 04 '22 13:07 ushachar

On the other thread we decided to go with the aux format:

823ca5eb86404530a2cd1f6beee7ed9c00e786fb 127.0.0.1:30001@40001,host-name master aux1:val1 aux2:val2 aux3:val3 ... - 0 1656441238093 1 connected 0-5460

madolson avatar Aug 23 '22 03:08 madolson

The only other change is unit tests can now be written for cluster features so they run in the ci, in tests/unit/cluster/*, so all of your changes need to get moves to those files now. Other than that I think we should be able to wrap this up quickly and merge.

madolson avatar Aug 23 '22 04:08 madolson

The only other change is unit tests can now be written for cluster features so they run in the ci, in tests/unit/cluster/*, so all of your changes need to get moves to those files now. Other than that I think we should be able to wrap this up quickly and merge.

Sounds good. Will move the unit tests to tests/unit/cluster/* next.

PingXie avatar Sep 06 '22 03:09 PingXie

@hwware @zuiderkwast the shard_id change is ready for your review. This PR includes the auxiliary field support as we agreed upon in PR https://github.com/redis/redis/pull/9564 (see https://github.com/redis/redis/pull/9564#discussion_r956177431). I also refactored the logic that builds various PING extensions including both HOSTNAME and FORGOTTEN_NODE. Please ignore PR #11239 (same payload but wrong branch)

PingXie avatar Sep 06 '22 03:09 PingXie

The only other change is unit tests can now be written for cluster features so they run in the ci, in tests/unit/cluster/*, so all of your changes need to get moves to those files now. Other than that I think we should be able to wrap this up quickly and merge.

Sounds good. Will move the unit tests to tests/unit/cluster/* next.

This test conversion is harder than I thought. @madolson I don't want to block your code review on the test code refactoring. I can take your feedback on other parts of the changes while working through the test issues.

PingXie avatar Sep 09 '22 06:09 PingXie

This test conversion is harder than I thought. @madolson I don't want to block your code review on the test code refactoring. I can take your feedback on other parts of the changes while working through the test issues.

Sorry :). Once we get all the tests converted life will be easier, but for now it's a bit of a pain since the two frameworks don't have all the same functions.

madolson avatar Sep 11 '22 00:09 madolson

making sure replicas follow primaries correctly during slot migrations. We'll also be able to support the "empty" primary with replicas use case.

Thanks @madolson! I have follow-up PRs to address both callouts (replicas following primaries and empty primaries) in the pipeline.

PS, the feature gap between the old and new test infra is indeed the blocker. Still trying to find a way to create a new Redis node and control how it is introduced to the cluster. I will continue the refactoring work but I am also open to moving the test refactoring work to another PR and unblocking other PRs that depend on the aux field change.

PingXie avatar Sep 14 '22 01:09 PingXie

@PingXie I'm fine punting it out of this PR to unblock as you mentioned. let me know if i can help, since I am the major one pushing for the refractor.

madolson avatar Sep 14 '22 01:09 madolson

Core group approved, but waiting on @yossigo to take a look before merging.

madolson avatar Sep 22 '22 15:09 madolson

I think CLUSTER SHARDS is a client-oriented command (as opposed to an administrative command), and I don't see clients' value in getting a shard-id. It may not do much harm, but I think it's an unnecessary long-term commitment to an implementation detail.

I am certainly biased here but here are a few thoughts of mine:

  1. I think CLUSTER SHARDS is a very useful admin command as well and more so than CLUSTER NODES thanks to its native RESP format;
  2. Identity is an integral part of any "objects" so having a CLUSTER SHARDS command that outputs a list of shards without their identities leaves a cognitive hole IMO;
  3. Just like how we spec'ed out the shard id format and its properties in cluster V2, the V1 shard id along with its format choice (40 bytes) is not an implementation detail IMHO;
  4. In the future, we could introduce commands that take shard ids as parameter, such as CLUSTER MIGRATE SLOT <n> SHARD <shard-id>. So being able to retrieve the shard-id easily using a RESP parser is valuable;
  5. This is just a minor point, we currently print out the node id in CLUSTER SHARDS too.

@zuiderkwast curious to hear your thoughts on whether to include shard-id in the CLUSTER SHARDS output.

PingXie avatar Sep 23 '22 23:09 PingXie

I agree with Ping here. Good points.

zuiderkwast avatar Sep 24 '22 20:09 zuiderkwast

@PingXie Some counter points:

  1. I actually think CLUSTER SHARDS is a bad Admin command. Why? It doesn't expose all of the internal state of cluster nodes. We ideally should also include additional metadata such ongoing data migrations. The intention, or at least the one I had, for CLUSTER SHARDS was that it would be a better replacement for CLUTSER SLOTS. While talking with Yossi, we thought that maybe we should introduce a new admin focused command to replace CLUSTER NODES. I will advocate for that here.
  2. Kind of, but going back to the idea that this is primarily for clients. What will "clients" do with this mapping? Nothing, they don't actually care.
  3. Sure? I'm fine stamping it as part of the spec here anyways since I'm not arguing against exposing it in CLUSTER MYSHARDID.
  4. See recommendation on 1.
  5. Clients should use node ID to understand continuity. When they see a node_id in a new shard, they should understand that it was moved to that shard and tear down all relevant connections to the old node_id.

Most of this presumes that we should build a new command CLUSTER ADMININFO or something, that is an administration command that shows administration info. EDIT: I also want to provide a pretty crisp boundary between the "admin" command and the "client" commands, since it's usually two different groups accessing these commands.

madolson avatar Sep 25 '22 13:09 madolson