scylladb icon indicating copy to clipboard operation
scylladb copied to clipboard

Fix node replace with tablets for RF=N

Open tgrabiec opened this issue 1 year ago • 26 comments

This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated.

The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state.

The replacing node waits for this draining to be complete on boot before the node is considered booted.

Fixes https://github.com/scylladb/scylladb/issues/17025

Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons:

  1. algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.

  2. tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets.

Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real.

tgrabiec avatar Feb 19 '24 00:02 tgrabiec

This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated.

There's something basic I don't understand. With tablets, why do we even have an operation called "replace"? You could, instead:

  1. can add a N+1th node, which starts without any tablets
  2. Now tablets start to move from other nodes to the N+1th node. If the leaving node wants to leave it will send all its tablets away, probably meany will reach the new N+1th node, which is the least loaded.
  3. Finally, the leaving node has no remaining tablets, and can be removed from the cluster. In this case, there is no time we would declare the replace to have "failed" or ever require "rollback". If the leaving node decides not to leave, it no longer sends its tablets away. If the new N+1th node decides it doesn't really want to join, it starts to move its tablets back to other nodes and then it can leave.

In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy?

nyh avatar Feb 19 '24 18:02 nyh

Tablets can't migrate away from a node that is down.

Replace should be equivalent to adding a new node and rebuilding tablets on the missing node, so indeed it should decompose into two simpler independent operations.

avikivity avatar Feb 19 '24 20:02 avikivity

Looks good except that I don't like the new coordinator column.

avikivity avatar Feb 19 '24 20:02 avikivity

In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy?

We could easily drop the wait during replacing node boot and let it run in the background, but there is a problem with it. Admin currently expects that once replace is done, availability is restored. So he can proceed to do rolling restart safely, for example. If he did it before tablets are rebuilt, then some tablets could end up with two replicas down. So the wait is to preserve this expectation.

tgrabiec avatar Feb 19 '24 23:02 tgrabiec

The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state.

The replacing node waits for this draining to be complete on boot before the node is considered booted.

TBH this does not look right to me. What happens if the new node failed after it become normal? From topology coordinator POV the transition completed already and the topology is stable.

I do not understand why the generic topology code should even care about the situation you described and the solution should be entirely tablets internal: When node fails in the RF=N situation we can mark all affected tablets as "degraded" (the flag should be per DC) and wait for any topology change to complete. On topology change completion check if some degraded tablets can be fixed and start fixing them if possible.

We can wait on boot for all degraded tables to disappear, but IMO it is better to amend rolling restart procedure to wait externally (we should provide an API to query it of course)

gleb-cloudius avatar Feb 20 '24 09:02 gleb-cloudius

The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted.

TBH this does not look right to me. What happens if the new node failed after it become normal? From topology coordinator POV the transition completed already and the topology is stable.

Rebuild will continue in the background, the replacing node is normal.

I do not understand why the generic topology code should even care about the situation you described and the solution should be entirely tablets internal: When node fails in the RF=N situation we can mark all affected tablets as "degraded" (the flag should be per DC) and wait for any topology change to complete. On topology change completion check if some degraded tablets can be fixed and start fixing them if possible.

Having replica in the "left" state is like marking the tablet as "degraded". What do you mean by "generic topology code should even care"?

We can wait on boot for all degraded tables to disappear, but IMO it is better to amend rolling restart procedure to wait externally (we should provide an API to query it of course)

The long term plan is to provide such an API, which waits for availability to be restored. The short term plan is to not change current procedures.

tgrabiec avatar Feb 20 '24 09:02 tgrabiec

The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted.

TBH this does not look right to me. What happens if the new node failed after it become normal? From topology coordinator POV the transition completed already and the topology is stable.

Rebuild will continue in the background, the replacing node is normal.

What if the new replacing node also died after it became normal but before rebuild completed.

I do not understand why the generic topology code should even care about the situation you described and the solution should be entirely tablets internal: When node fails in the RF=N situation we can mark all affected tablets as "degraded" (the flag should be per DC) and wait for any topology change to complete. On topology change completion check if some degraded tablets can be fixed and start fixing them if possible.

Having replica in the "left" state is like marking the tablet as "degraded".

But it is not enough. You also need to preserve the replication state of the node and then you prose to introduce a new state to get read of it. Why not preserve the information you need for tablet repair in tables and get rid of it the moment it is not needed any longer?

You also say that

" algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first."

Why those algorithm cares about non existing node?

What do you mean by "generic topology code should even care"?

The non tables topology coordinator does not suppose to do anything special about tables that happen to have not enough replicas.

We can wait on boot for all degraded tables to disappear, but IMO it is better to amend rolling restart procedure to wait externally (we should provide an API to query it of course)

The long term plan is to provide such an API, which waits for availability to be restored. The short term plan is to not change current procedures.

If we mark tables as degraded it is trivial to provide an api that checks for existence of such tables.

gleb-cloudius avatar Feb 20 '24 11:02 gleb-cloudius

On Tue, Feb 20, 2024 at 12:31 PM Gleb Natapov @.***> wrote:

The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted.

TBH this does not look right to me. What happens if the new node failed after it become normal? From topology coordinator POV the transition completed already and the topology is stable.

Rebuild will continue in the background, the replacing node is normal.

What if the new replacing node also died after it became normal but before rebuild completed.

Rebuild is not coordinated by the replacing node, but by the central tablet scheduler. Replacing node dying is no different from any other node dying. The new replica for rebuild is chosen among all normal nodes.

I do not understand why the generic topology code should even care about the situation you described and the solution should be entirely tablets internal: When node fails in the RF=N situation we can mark all affected tablets as "degraded" (the flag should be per DC) and wait for any topology change to complete. On topology change completion check if some degraded tablets can be fixed and start fixing them if possible.

Having replica in the "left" state is like marking the tablet as "degraded".

But it is not enough. You also need to preserve the replication state of the node and then you prose to introduce a new state to get read of it. Why not preserve the information you need for tablet repair in tables and get rid of it the moment it is not needed any longer?

You also say that

" algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first."

Why those algorithm cares about non existing node?

They care about replica pairing between base and view tables. Example:

base: {n1, n2, n3} view: {n4, n5, n6}

n2 is paired with n5, and that's implicit in the replica order.

If we replace n5, and just remove the node from the replica set, we'd have:

base: {n1, n2, n3} view: {n4, n6} [degraded]

and n2 would not incorrectly pair with n6.

By keeping the left node until the tablet is rebuilt we preserve pairing. The replacement will jump in place of the left node, e.g.:

base: {n1, n2, n3} view: {n4, n7, n6}

So we also need to remember the position.

Just keeping the node still there is simpler than recording this information aside, since algorithms can stay symmetric without special-casing for it.

We could keep it aside in a new data structure, but that would make things more complicated, since this information would have to be integrated back with the main replica set. Why invent a new data structure if we already have one (system.topology and system.tablets)?

What do you mean by "generic topology code should even care"?

The non tables topology coordinator does not suppose to do anything special about tables that happen to have not enough replicas.

Which part are you referring to? Loading the nodes in the left state? The main algorithm is unchanged.

We can wait on boot for all degraded tables to disappear, but IMO it is better to amend rolling restart procedure to wait externally (we should provide an API to query it of course)

The long term plan is to provide such an API, which waits for availability to be restored. The short term plan is to not change current procedures.

If we mark tables as degraded it is trivial to provide an api that checks for existence of such tables.

Yes, but it can also be computed from state like in this PR.

Message ID: @.***>

tgrabiec avatar Feb 20 '24 16:02 tgrabiec

They care about replica pairing between base and view tables. Example:

base: {n1, n2, n3} view: {n4, n5, n6}

n2 is paired with n5, and that's implicit in the replica order.

If we replace n5, and just remove the node from the replica set, we'd have:

base: {n1, n2, n3} view: {n4, n6} [degraded]

Make it view: {n4, degraded, n6}

and n2 would not incorrectly pair with n6.

By keeping the left node until the tablet is rebuilt we preserve pairing. The replacement will jump in place of the left node, e.g.:

base: {n1, n2, n3} view: {n4, n7, n6}

So we also need to remember the position.

Just keeping the node still there is simpler than recording this information aside, since algorithms can stay symmetric without special-casing for it.

We could keep it aside in a new data structure, but that would make things more complicated, since this information would have to be integrated back with the main replica set. Why invent a new data structure if we already have one (system.topology and system.tablets)?

Because we cannot pin that information forever like the patch is doing.

Which part are you referring to? Loading the nodes in the left state? The main algorithm is unchanged.

Yes. Left nodes are left. We should not keep their information forever in memory. After a node is left its host id is used only for banning. In fact we can free even the list of ids that we have now in the topology sm after we ban them. So what tablets should do it to record all the information they need to do rebuild and drop it the moment it is not needed any longer.

Yes, but it can also be computed from state like in this PR.

It should be calculated once during removenode, not each time it is needed because this pins the information that is used for the calculation forever.

gleb-cloudius avatar Feb 20 '24 16:02 gleb-cloudius

On Tue, Feb 20, 2024 at 5:45 PM Gleb Natapov @.***> wrote:

They care about replica pairing between base and view tables. Example:

base: {n1, n2, n3} view: {n4, n5, n6}

n2 is paired with n5, and that's implicit in the replica order.

If we replace n5, and just remove the node from the replica set, we'd have:

base: {n1, n2, n3} view: {n4, n6} [degraded]

Make it view: {n4, degraded, n6}

That would make the get_replicas() API inconvenient. Instead of returning a vector<host_id> it would return vector<variant<host_id, degraded>>, and everyone using it would have to handle the case of a degraded replica.

It also means that we cannot store this information in system.tablets, which has replicas as list. And we'd have to integrate it back whenever replicas are quired, which adds run time overhead.

We wanted to have dense tablet metadata, and store node* in the replica set in memory.

and n2 would not incorrectly pair with n6.

By keeping the left node until the tablet is rebuilt we preserve pairing. The replacement will jump in place of the left node, e.g.:

base: {n1, n2, n3} view: {n4, n7, n6}

So we also need to remember the position.

Just keeping the node still there is simpler than recording this information aside, since algorithms can stay symmetric without special-casing for it.

We could keep it aside in a new data structure, but that would make things more complicated, since this information would have to be integrated back with the main replica set. Why invent a new data structure if we already have one (system.topology and system.tablets)?

Because we cannot pin that information forever like the patch is doing.

If this is the main problem, it can be easily fixed by adding garbage collection to tablet scheduling which drops those nodes once no tablet references them.

Which part are you referring to? Loading the nodes in the left state? The main algorithm is unchanged.

Yes. Left nodes are left. We should not keep their information forever in memory. After a node is left its host id is used only for banning. In fact we can free even the list of ids that we have now in the topology sm after we ban them. So what tablets should do it to record all the information they need to do rebuild and drop it the moment it is not needed any longer.

But is there a problem with keeping them there? system.topology rules can be adjusted to serve tablet needs, this is part of the same project.

Yes, but it can also be computed from state like in this PR.

It should be calculated once during removenode, not each time it is needed because this pins the information that is used for the calculation forever.

Once all tablets are rebuilt, we can drop this information. If we didn't keep the information in system.topology, we'd have to keep it in a new table, and have exactly the same problem.

Message ID: @.***>

tgrabiec avatar Feb 20 '24 17:02 tgrabiec

That would make the get_replicas() API inconvenient. Instead of returning a vector<host_id> it would return vector<variant<host_id, degraded>>, and everyone using it would have to handle the case of a degraded replica.

It also means that we cannot store this information in system.tablets, which has replicas as list. And we'd have to integrate it back whenever replicas are quired, which adds run time overhead.

Can we have special UUID that will indicate that no mapping is available for the MV right now?

Once all tablets are rebuilt, we can drop this information. If we didn't keep the information in system.topology, we'd have to keep it in a new table, and have exactly the same problem.

Drop how? We reload the information on each state change. Unless we introduce new node state like you proposed I do not see how exactly it can be done. If we move the information into a special place though (not sure new table is needed, May be existing one can be extended), we can simply remove it from there the moment migration is completed.

If this is the main problem, it can be easily fixed by adding garbage collection to tablet scheduling which drops those nodes once no tablet references them.

Again, drop from where? Unless we put them in a special place from where they can be garbage collected, I do not see what this garbage collector would do. Remove left nodes from the topology table? But we explicitly do not want to do that.

But is there a problem with keeping them there? system.topology rules can be adjusted to serve tablet needs, this is part of the same project.

Of course. But the adjustment proposed in this series is "lets keep all the data about all left nodes in memory forever" and I do not think it is reasonable.

What about the tablet code loading the information about left node on demand when needed and dropping it after use?

gleb-cloudius avatar Feb 20 '24 17:02 gleb-cloudius

That would make the get_replicas() API inconvenient. Instead of returning a vector<host_id> it would return vector<variant<host_id, degraded>>, and everyone using it would have to handle the case of a degraded replica. It also means that we cannot store this information in system.tablets, which has replicas as list. And we'd have to integrate it back whenever replicas are quired, which adds run time overhead.

Can we have special UUID that will indicate that no mapping is available for the MV right now?

I asked for it (see above), but we also need to know what rack and datacenter the replica was on. A special uuid won't remember that.

avikivity avatar Feb 20 '24 17:02 avikivity

In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy?

We could easily drop the wait during replacing node boot and let it run in the background, but there is a problem with it. Admin currently expects that once replace is done, availability is restored. So he can proceed to do rolling restart safely, for example. If he did it before tablets are rebuilt, then some tablets could end up with two replicas down. So the wait is to preserve this expectation.

I'm suggesting that there won't be an operation called a "replace". There will be a "add a new node" (without data) and "remove a live node" (which sends all its data out, probably a lot of it to the relatively vacant new node). The administrator can wait for the "remove a live node" operation to finish if he wants. Why do we need an operation called "replace" - just because we have a nodetool command with that name?

Note that while "replace" always replaces one node, with tablets we can add and remove multiple nodes together. It will be more powerful than what we can do with "nodetool replace".

nyh avatar Feb 20 '24 22:02 nyh

On Tue, Feb 20, 2024 at 6:19 PM Gleb Natapov @.***> wrote:

That would make the get_replicas() API inconvenient. Instead of returning a vector<host_id> it would return vector<variant<host_id, degraded>>, and everyone using it would have to handle the case of a degraded replica.

It also means that we cannot store this information in system.tablets, which has replicas as list. And we'd have to integrate it back whenever replicas are quired, which adds run time overhead.

Can we have special UUID that will indicate that no mapping is available for the MV right now?

Once all tablets are rebuilt, we can drop this information. If we didn't keep the information in system.topology, we'd have to keep it in a new table, and have exactly the same problem.

Drop how? We reload the information on each state change. Unless we introduce new node state like you proposed I do not see how exactly it can be done. If we move the information into a special place though (not sure new table is needed, May be existing one can be extended), we can simply remove it from there the moment migration is completed.

When the tablet scheduler sees that there is a left node in topology (loaded) which is no longer referenced by any tablet, it will emit a group0 update to mark those nodes are on-disk only. It doesn't have to be a node state, it can be another flag on the row. It will not be loaded to memory anymore.

If this is the main problem, it can be easily fixed by adding garbage collection to tablet scheduling which drops those nodes once no tablet references them.

Again, drop from where? Unless we put them in a special place from where they can be garbage collected, I do not see what this garbage collector would do. Remove left nodes from the topology table? But we explicitly do not want to do that.

But is there a problem with keeping them there? system.topology rules can be adjusted to serve tablet needs, this is part of the same project.

Of course. But the adjustment proposed in this series is "lets keep all the data about all left nodes in memory forever" and I do not think it is reasonable.

I wonder how much of a problem it actually is. We keep the ids in memory now so it's O(left nodes) already.

What about the tablet code loading the information about left node on demand when needed and dropping it after use?

It's more tricky. To know which nodes to load, you need to read tablet metadata. But to build tablet metadata, you may want to have those nodes in topology already to store node*. We'd have to split the process into two steps.

Message ID: @.***>

tgrabiec avatar Feb 21 '24 09:02 tgrabiec

When the tablet scheduler sees that there is a left node in topology (loaded) which is no longer referenced by any tablet, it will emit a group0 update to mark those nodes are on-disk only. It doesn't have to be a node state, it can be another flag on the row. It will not be loaded to memory anymore.

Can we get the info from the tablet metadata instead? What bothers me is that we already have this info. Tablets know that they have replica on a certain node.

I wonder how much of a problem it actually is. We keep the ids in memory now so it's O(left nodes) already.

This list is used for banning only which is done during state load. It can be dropped from memory easily without any change to on disk format. It is also possible to ban node while reading from the table and make the while process O(1). We should not make it mandatory to always have it in memory.

It's more tricky. To know which nodes to load, you need to read tablet metadata.

You have it in memory, no?

But to build tablet metadata, you may want to have those nodes in topology already to store node*. We'd have to split the process into two steps.

So do you also want to have left nodes in token_metadata as well? Then making sure that only relevant nodes are loaded is even more important.

Also is it a good idea to have left nodes in token_metadata. We are breaking invariants for a very rare use case. May be @nyh is right and we can simply refuse to replace node in such situation and require to do add/remove instead.

gleb-cloudius avatar Feb 21 '24 10:02 gleb-cloudius

On Tue, Feb 20, 2024 at 11:31 PM nyh @.***> wrote:

In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy?

We could easily drop the wait during replacing node boot and let it run in the background, but there is a problem with it. Admin currently expects that once replace is done, availability is restored. So he can proceed to do rolling restart safely, for example. If he did it before tablets are rebuilt, then some tablets could end up with two replicas down. So the wait is to preserve this expectation.

I'm suggesting that there won't be an operation called a "replace". There will be a "add a new node" (without data) and "remove a live node" (which sends all its data out, probably a lot of it to the relatively vacant new node). The administrator can wait for the "remove a live node" operation to finish if he wants. Why do we need an operation called "replace" - just because we have a nodetool command with that name?

Yes, we have "nodetool replace" which we want to support.

Note that There could be both vnode-based keyspaces and tablet-based keyspaces coexisting, and if one runs "replace" for the sake of vnodes, tablets need to be handled somehow too.

Note that while "replace" always replaces one node, with tablets we can add and remove multiple nodes together. It will be more powerful than what we can do with "nodetool replace".

Yes, but that's a longer term project.

Message ID: @.***>

tgrabiec avatar Feb 21 '24 12:02 tgrabiec

On Wed, Feb 21, 2024 at 11:16 AM Gleb Natapov @.***> wrote:

When the tablet scheduler sees that there is a left node in topology (loaded) which is no longer referenced by any tablet, it will emit a group0 update to mark those nodes are on-disk only. It doesn't have to be a node state, it can be another flag on the row. It will not be loaded to memory anymore.

Can we get the info from the tablet metadata instead? What bothers me is that we already have this info. Tablets know that they have replica on a certain node.

We have the info about host_id, but we also need topology info about that node (rack, dc, shard count). This info is not in tablet metadata, it's per-node.

I wonder how much of a problem it actually is. We keep the ids in memory now so it's O(left nodes) already.

This list is used for banning only which is done during state load. It can be dropped from memory easily without any change to on disk format. It is also possible to ban node while reading from the table and make the while process O(1). We should not make it mandatory to always have it in memory.

Ok.

It's more tricky. To know which nodes to load, you need to read tablet metadata.

You have it in memory, no?

No, this is at metadata load time from group0 state on disk. We cannot use stale metadata in memory, it may no longer be the latest.

But to build tablet metadata, you may want to have those nodes in topology already to store node*. We'd have to split the process into two steps.

So do you also want to have left nodes in token_metadata as well? Then making sure that only relevant nodes are loaded is even more important.

Only in locator::topology. they will not be considered members of token_metadata.

Also is it a good idea to have left nodes in token_metadata. We are breaking invariants for a very rare use case. May be @nyh https://github.com/nyh is right and we can simply refuse to replace node in such situation and require to do add/remove instead.

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylladb/pull/17388#issuecomment-1956317694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACFIL2A47ZF4UVAATLZ46DYUXCQXAVCNFSM6AAAAABDOSFWI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWGMYTONRZGQ . You are receiving this because you authored the thread.Message ID: @.***>

tgrabiec avatar Feb 21 '24 12:02 tgrabiec

We have the info about host_id, but we also need topology info about that node (rack, dc, shard count). This info is not in tablet metadata, it's per-node.

Yes, I understand that. What I am saying is that tables should know what nodes they need topology info for. We can load only those nodes into a special map. We do not need special persistent flag per node in the topology table for that.

No, this is at metadata load time from group0 state on disk. We cannot use stale metadata in memory, it may no longer be the latest.

What metadata we cannot use? Why can't we check with tables if they think that some node that is in the left state hold data for them?

Only in locator::topology. they will not be considered members of token_metadata.

Still I do not think we want to have all past and present nodes to be there forever.

BTW github works hard to keep email integration as bad as it is humanly possible. It is not easy to reply to messages sent by email since they need to be reformatted manually.

gleb-cloudius avatar Feb 21 '24 13:02 gleb-cloudius

On Wed, Feb 21, 2024 at 2:19 PM Gleb Natapov @.***> wrote:

We have the info about host_id, but we also need topology info about that node (rack, dc, shard count). This info is not in tablet metadata, it's per-node.

Yes, I understand that. What I am saying is that tables should know what nodes they need topology info for. We can load only those nodes into a special map. We do not need special persistent flag per node in the topology table for that.

But note that this map/info is not only needed by tablet internals. Those hosts are returned in erm::get_replicas(), and are interpreted via locator::topology outside tablets in generic code. So the solution needs to integrate with the whole system. That's why I think they should be loaded into locator::topology, since that's where all the system gets the location info from.

If erm::get_replicas() returned vector<node*> then it would be even easier, since the caller wouldn't have to know where the node* is stored. But the way it is now, callers expect to lookup in locator::topology by host_id.

No, this is at metadata load time from group0 state on disk. We cannot use stale metadata in memory, it may no longer be the latest.

What metadata we cannot use?

Tablet metadata in memory.

Why can't we check with tables if they think that some node that is in the left state hold data for them?

The topology state load sequence is this:

  1. load raft topology and populate locator::topology
  2. load tablet metadata

(2) depends on (1) if we want to keep node* in it.

If we wanted to decide which nodes to load in step (1) based on tablet metadata loaded in step (2), there is a chicken-and-egg problem

With the flag approach to indicate whether to load left nodes there is no issue.

Alternatively, we could do something like this:

  1. scan tablet metadata to compute ref = unordered_set<host_id>
  2. load raft topology and populate locator::topology, loading left nodes if they are in "ref"
  3. load tablet metadata

Only in locator::topology. they will not be considered members of token_metadata.

Still I do not think we want to have all past and present nodes to be there forever.

Agree, we can do GC when tablets no longer reference the node. That can be done with wherever we keep them.

Message ID: @.***>

tgrabiec avatar Feb 21 '24 22:02 tgrabiec

Can you please write replies using web. The integration simply does not work. I am all for abandoning the crappy github PR interface and move to the mailing list. So if you like emails as much as I do lets complain to decision maker (not that they care).

Yes, I understand that. What I am saying is that tables should know what nodes they need topology info for. We can load only those nodes into a special map. We do not need special persistent flag per node in the topology table for that.

But note that this map/info is not only needed by tablet internals.

We will update topology from this special map. But this special map will contain only nodes that tablets really need, not all of them.

The topology state load sequence is this:

  1. load raft topology and populate locator::topology
  2. load tablet metadata

If we wanted to decide which nodes to load in step (1) based on tablet metadata loaded in step (2), there is a chicken-and-egg problem

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

Alternatively, we could do something like this:

  1. scan tablet metadata to compute ref = unordered_set<host_id>
  2. load raft topology and populate locator::topology, loading left nodes if they are in "ref"
  3. load tablet metadata

Yes, but I think for the (1) the previous version of tablets metadata in memory is good enough.

gleb-cloudius avatar Feb 22 '24 09:02 gleb-cloudius

Yes, I understand that. What I am saying is that tables should know what nodes they need topology info for. We can load only those nodes into a special map. We do not need special persistent flag per node in the topology table for that.

But note that this map/info is not only needed by tablet internals.

We will update topology from this special map. But this special map will contain only nodes that tablets really need, not all of them.

The topology state load sequence is this:

  1. load raft topology and populate locator::topology
  2. load tablet metadata

If we wanted to decide which nodes to load in step (1) based on tablet metadata loaded in step (2), there is a chicken-and-egg problem

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

tgrabiec avatar Feb 22 '24 09:02 tgrabiec

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

gleb-cloudius avatar Feb 22 '24 09:02 gleb-cloudius

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

tgrabiec avatar Feb 22 '24 10:02 tgrabiec

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

If it allows us to simplify the code we should reason about it and decided whether it is safe or not.

gleb-cloudius avatar Feb 22 '24 10:02 gleb-cloudius

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

If it allows us to simplify the code we should reason about it and decided whether it is safe or not.

How do you prove that, and then maintain this assumption? It requires global program analysis, since you don't really if there will be a later event which will cause another reload before metadata is accessed.

Another reason is that it precludes storing node* in token metadata, which needs the left nodes in locator::topology in the same reload.

tgrabiec avatar Feb 22 '24 10:02 tgrabiec

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

If it allows us to simplify the code we should reason about it and decided whether it is safe or not.

How do you prove that, and then maintain this assumption? It requires global program analysis, since you don't really if there will be a later event which will cause another reload before metadata is accessed.

We do not formally prove any of our code formally, so why suddenly ask for it now? But we can reason about this corner case. So a node that joins and gets first state snapshot will not initially load nodes that are left but still needed by some tablets into the topology. But what this information is used for? To calculate dst replicate for requests? This node will not start serving queries until some more state reloads. By tablet load balancer? This node will not run it before reloading state many more times because it is even non voter at this point. Something else?

We can also add the code that notice that token metadata was missing from memory and run load one more time. But only if it is absolutely needed.

Another reason is that it precludes storing node* in token metadata, which needs the left nodes in locator::topology in the same reload.

I do not understand this. Normal state is that no left nodes are loaded. But sometimes tablets say they need some left nodes to be loaded, so we load them. But the normal state is no left nodes, so how this precludes anything?

gleb-cloudius avatar Feb 22 '24 10:02 gleb-cloudius

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

If it allows us to simplify the code we should reason about it and decided whether it is safe or not.

How do you prove that, and then maintain this assumption? It requires global program analysis, since you don't really if there will be a later event which will cause another reload before metadata is accessed.

We do not formally prove any of our code formally, so why suddenly ask for it now?

I'm not asking for a formal prove, just a correct program. I don't know if a program is correct if it relies on assumptions which are hard to verify and hard to ensure that they hold in the future. Suppose we later accidentally optimize joining so that there are fewer reloads, or add access to tablet metadata which happens before reload. The system will break. And all this difficulty and fragility to avoid adding a flag? We should code with invariants, and in this case the invariant is that all replicas returned by erm have an entry in locator::topology, at all time.

I see two solutions for conditionally loading left nodes:

  1. tablet scheduler decides they are no longer needed and sets a flag in system.topology which inhibits loading
  2. we decide this during group0 state load, based on system.tablets in storage

The advantage of (1) is that it doesn't require reading tablet metadata from storage. The advantage of (2) is that the computation is local and no extra state is needed, but we have to scan all tablets on every reload, which precludes optimization for partial reload.

But we can reason about this corner case. So a node that joins and gets first state snapshot will not initially load nodes that are left but still needed by some tablets into the topology. But what this information is used for? To calculate dst replicate for requests? This node will not start serving queries until some more state reloads. By tablet load balancer? This node will not run it before reloading state many more times because it is even non voter at this point. Something else?

We can also add the code that notice that token metadata was missing from memory and run load one more time. But only if it is absolutely needed.

Another reason is that it precludes storing node* in token metadata, which needs the left nodes in locator::topology in the same reload.

I do not understand this. Normal state is that no left nodes are loaded. But sometimes tablets say they need some left nodes to be loaded, so we load them. But the normal state is no left nodes, so how this precludes anything?

If we want to store node* in replica sets in tablet metadata, we need left nodes in locator::topology even on the first load, since they must be in locator::topology to resolve them into node*. They can't be resolved later on-demand, since access to replica set should be fast.

tgrabiec avatar Feb 24 '24 00:02 tgrabiec

The in memory tabled metadata cannot be older then the previous version since as you say we load them on after another and I claim that when a replaced node moves to the left state the previous in memory token metadata is good enough to figure out what left nodes needs to be loaded. The worst case that may happen that we will load a node that is no longer needed by the next version of the topology metadata, but who cares? The only case we should worry about is to not load something that should be loaded and I do not see how it can happen.

What if there is no previous tablet metadata because this is a fresh node?

Shouldn't it be safe not load any of the left nodes in this case anyway? Until the node boots up to normal state it will reload the state many time.

Maybe, or maybe not. I think we shouldn't assume this during loading.

If it allows us to simplify the code we should reason about it and decided whether it is safe or not.

How do you prove that, and then maintain this assumption? It requires global program analysis, since you don't really if there will be a later event which will cause another reload before metadata is accessed.

We do not formally prove any of our code formally, so why suddenly ask for it now?

I'm not asking for a formal prove, just a correct program. I don't know if a program is correct if it relies on assumptions which are hard to verify and hard to ensure that they hold in the future.

Why is it hard to verify? Lets not base the current code on the future that nobody knows. We may add an assert that checks that if node moves to normal it has in memory tablet metdata already.

Suppose we later accidentally optimize joining so that there are fewer reloads, or add access to tablet metadata which happens before reload. The system will break. And all this difficulty and fragility to avoid adding a flag? We should code with invariants, and in this case the invariant is that all replicas returned by erm have an entry in locator::topology, at all time.

Fewer reloads do not matter. Only no reloads matter and I doubt we optimize it like that, but the assert above will catch even that. If you change the code to access something earlier you need to make sure it is initialized. That cannot be an argument to add state that is not needed. I am against the persistent flag because it duplicates the sate that is already present in the system, but requires separate maintenance and making sure it does not goes out of sync now, not in some theoretical dystopian feature. If we need to change some order in state loading code because the info what needs to be loaded is distributed now between topology table and tablets table then that is what needs to be done, not use persistent tables to pass transient information between subsystems.

I see two solutions for conditionally loading left nodes:

1. tablet scheduler decides they are no longer needed and sets a flag in system.topology which inhibits loading

2. we decide this during group0 state load, based on system.tablets in storage

The advantage of (1) is that it doesn't require reading tablet metadata from storage. The advantage of (2) is that the computation is local and no extra state is needed, but we have to scan all tablets on every reload, which precludes optimization for partial reload.

  1. system.topology load code remains at it is, but tablet loading code loads nodes that are left but still needed by tables from system.topology. This is simple and efficient query.

But we can reason about this corner case. So a node that joins and gets first state snapshot will not initially load nodes that are left but still needed by some tablets into the topology. But what this information is used for? To calculate dst replicate for requests? This node will not start serving queries until some more state reloads. By tablet load balancer? This node will not run it before reloading state many more times because it is even non voter at this point. Something else? We can also add the code that notice that token metadata was missing from memory and run load one more time. But only if it is absolutely needed.

Another reason is that it precludes storing node* in token metadata, which needs the left nodes in locator::topology in the same reload.

I do not understand this. Normal state is that no left nodes are loaded. But sometimes tablets say they need some left nodes to be loaded, so we load them. But the normal state is no left nodes, so how this precludes anything?

If we want to store node* in replica sets in tablet metadata, we need left nodes in locator::topology even on the first load, since they must be in locator::topology to resolve them into node*. They can't be resolved later on-demand, since access to replica set should be fast.

If we have inter dependencies between topology load and tablet load (topology needs tablets to know what to load and tablets need topology to build metadata) the (one) way to resolve is to split tablets state loading and metadata building into two stages. The we can load tablets data, load topology based on the data, build token metadata for tablets from tablets data + topology data. But may be it is enough to just load left nodes from tablets code while building tablet metadata. There is nothing wrong in accessing topology tables from tablets code directly.

gleb-cloudius avatar Feb 25 '24 09:02 gleb-cloudius

On Tue, Feb 20, 2024 at 11:31 PM nyh @.***> wrote: In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy? We could easily drop the wait during replacing node boot and let it run in the background, but there is a problem with it. Admin currently expects that once replace is done, availability is restored. So he can proceed to do rolling restart safely, for example. If he did it before tablets are rebuilt, then some tablets could end up with two replicas down. So the wait is to preserve this expectation. I'm suggesting that there won't be an operation called a "replace". There will be a "add a new node" (without data) and "remove a live node" (which sends all its data out, probably a lot of it to the relatively vacant new node). The administrator can wait for the "remove a live node" operation to finish if he wants. Why do we need an operation called "replace" - just because we have a nodetool command with that name? Yes, we have "nodetool replace" which we want to support. Note that There could be both vnode-based keyspaces and tablet-based keyspaces coexisting, and if one runs "replace" for the sake of vnodes, tablets need to be handled somehow too.

What nodetool replace? replace is currently done using the replace_node_first_boot config option, plus and optional ignore_dead_nodes_for_replace option.

If we add support to ignore dead nodes on bootstrap, we could just bootstrap the replacing node given the option to ignore the replaced node (and possibly others if there are multiple dead nodes), and then simply run nodetool removenode to start the removal of the replaced node(s). The tablet scheduler should pick the newly added node for tablet rebuild as it will be emptier, and in this issue's case, it would also be the only valid option for placement, if there's only one dead node to replace in the same dc/rack.

Note that while "replace" always replaces one node, with tablets we can add and remove multiple nodes together. It will be more powerful than what we can do with "nodetool replace". Yes, but that's a longer term project.

bhalevy avatar Mar 03 '24 08:03 bhalevy

On Tue, Feb 20, 2024 at 11:31 PM nyh @.***> wrote: In general I thought one of the guiding principles of the new tablet technique is that topology changes are trivial, and don't actually move tablets which are later moved tablet by tablet. It sounds here you're going the opposite direction, with some node not even finishing its boot before all the tablets are moved. Isn't this going back to the vnode philosophy? We could easily drop the wait during replacing node boot and let it run in the background, but there is a problem with it. Admin currently expects that once replace is done, availability is restored. So he can proceed to do rolling restart safely, for example. If he did it before tablets are rebuilt, then some tablets could end up with two replicas down. So the wait is to preserve this expectation. I'm suggesting that there won't be an operation called a "replace". There will be a "add a new node" (without data) and "remove a live node" (which sends all its data out, probably a lot of it to the relatively vacant new node). The administrator can wait for the "remove a live node" operation to finish if he wants. Why do we need an operation called "replace" - just because we have a nodetool command with that name? Yes, we have "nodetool replace" which we want to support. Note that There could be both vnode-based keyspaces and tablet-based keyspaces coexisting, and if one runs "replace" for the sake of vnodes, tablets need to be handled somehow too.

What nodetool replace? replace is currently done using the replace_node_first_boot config option, plus and optional ignore_dead_nodes_for_replace option.

Yes.

If we add support to ignore dead nodes on bootstrap, we could just bootstrap the replacing node given the option to ignore the replaced node (and possibly others if there are multiple dead nodes), and then simply run nodetool removenode to start the removal of the replaced node(s). The tablet scheduler should pick the newly added node for tablet rebuild as it will be emptier, and in this issue's case, it would also be the only valid option for placement, if there's only one dead node to replace in the same dc/rack.

If we didn't have vnode-based topology, yes. But with vnodes, to support replace-with-the-same-ip, such bootstrap is inseparable from removing the node, it is a replace operation. If we want to support such operation, we need to handle it with tablets somehow too.

tgrabiec avatar Mar 04 '24 13:03 tgrabiec