vitess icon indicating copy to clipboard operation
vitess copied to clipboard

Online DDL: explicit target shards via ddl_strategy

Open shlomi-noach opened this issue 3 years ago • 23 comments

Description

This PR extends #7785

Only review once #7785 is merged

Run online DDL on explicit shards.

ddl_strategy now supports --shards flag. The following are valid:

  • set @@ddl_strategy='online -shards=40-80' - specifying a single shard 40-80
  • set @@ddl_strategy='gh-ost -shards=-40,40-80' - specifying two shards -40 and 40-80
  • set @@ddl_strategy='gh-ost -shards="-40,40-80"' - quotes supported
  • set @@ddl_strategy='gh-ost --shards-"40,80"' - two dashes

same for vtctl ApplySchema -ddl_strategy "..."

Related Issue(s)

  • #6926

Checklist

  • [ ] Should this PR be backported?
  • [x] Tests were added or are not required
  • [ ] Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • [x] Query Serving
  • [ ] VReplication
  • [ ] Cluster Management
  • [ ] Build/CI
  • [ ] VTAdmin

shlomi-noach avatar Apr 12 '21 07:04 shlomi-noach

What is the use case for explicit shard-targeting? Apologies if it is already documented in the tracking issue. I scanned it but didn't see a relevant section.

deepthi avatar Apr 12 '21 23:04 deepthi

What is the use case for explicit shard-targeting?

You're right and this is not explained in writing. The use case is that #7785 begins the move away from topo. The big advantage of using topo was that when the user submits an online DDL, a request is persisted in topo; from there on vtctld makes sure to send the request to all relevant shards, and keeps on doing it until it is satisfied all shards received it.

Moving away from topo means when the user submits an Online DDL request, and much like a normal DDL request, vtgate or vtctl just send it directly to shards/tablets. If any of those tablets are down, then they miss the request and there's no mechanism to retry and ensure they finally get it.

And so it's important to be able to later be able to target a specific shard (one or more). This does put the burden on the user, but at least it paves a way to ensure all tablets receive the online DDL.

I should note that the problem is almost moot if the user uses the -declarative approach, since in -declarative approach the user is free to resubmit the same online DDL again and again and again on all shards anyhow. But -declarative is just an optional approach.

shlomi-noach avatar Apr 13 '21 04:04 shlomi-noach

ready for review, now that #7785 is merged

shlomi-noach avatar Apr 22 '21 13:04 shlomi-noach

@shlomi-noach Would a feature like the following make sense? I'm not sue if you've already considered/implemented this or not, but based on the existence of this PR, I'm guessing it isn't something that exists yet.

Suppose you submit a migration to a (set of) table(s) in a keyspace. The migration is scheduled successfully on some shards, but unsuccessfully on others. The user (presumably) receives an error from VTGate. The user re-submits the exact same request.

During handling of the second request, VTGate determines that, within this keyspace, some of the shards have this query string already, and it is the most recently scheduled migration for that (set of) table(s), and they have the same migration UUID on all of those shards. The migration doesn't need to be re-applied on those shards, and the previous UUID can be returned at the end of the request. On the rest of the shards, the migration string is missing. In this case, the UUID can be re-used, and the migration scheduled on those shards.

This can be used for:

  • Making non-declarative migrations somewhat idempotent
  • Making -singleton -declarative migrations more idempotent (this is where I'm making an assumption - I'm assuming they aren't right now, in the edge case where the schema migration gets partially-applied)
  • Automating what a user would need to do in order to make use of -shards anyway

I assume the tricky things here would be

  • Additional pre-work the VTGate would have to do before submitting the migrations to the VTTablets, in the form of doing an all-shards query to find out if the migration exists already
  • Some potentially-tricky logic to make sure that it is actually safe to re-use the UUID (this isn't safe if the migration isn't the latest migration on those tables)

========

Alternatively, I'm wondering if you can make use of Vitess's two-phase commit when scheduling migrations on the _vt database of the shards, even if two-phase commit isn't enabled for the main user database.

If not, could there maybe be a more simplistic, domain-specific implementation of two-phase commit or three-phase commit specifically for schema migration scheduling. If a user re-submits a schema migration query string, and that query string is found to be in the prepared to commit state on some of the shards, then the UUID is reused and the scheme migration scheduling continues where it left off.

jmoldow avatar Apr 25 '21 20:04 jmoldow

I'm guessing it isn't something that exists yet.

Correct :)

The user (presumably) receives an error from VTGate.

They do.

During handling of the second request, VTGate determines that, within this keyspace, some of the shards have this query string already, and it is the most recently scheduled migration for that (set of) table(s), and they have the same migration UUID on all of those shards. The migration doesn't need to be re-applied on those shards, and the previous UUID can be returned at the end of the request. On the rest of the shards, the migration string is missing. In this case, the UUID can be re-used, and the migration scheduled on those shards. Making non-declarative migrations somewhat idempotent

This makes sense and sounds good. I think there could be edge cases. To break it down a bit:

  • If the user runs an Online DDL migration
  • VTGate determines the affected table
  • VTGate determines the target shards
  • VTGate checks on all shards: what is the most recent migration you have for that table
  • If all shards indicate some other migration, then the user's migration is new and gets submitted
  • If all shards respond, all saying "this is the exact same migration we last ran on this table", then probably no need to run this again at all? (what if it's a alter table t engine=innodb? does it make sense to run again?)
  • If all shards respond, some saying "yes this is the last one we ran" and others say "we last ran a different migration", then, is it safe to say some shards never got that migration?
    • Could there be migrations running out of order across shards? Yes, if on a shard some migration failed, another could proceed.
    • So let's get more strict and have VTGate double check: "OK give me the UUID you ran for that particular query, good, now let me see if those other shards have at all executed it"
    • In which case probably the correct thing to do is to submit the user's migration only on the shards that never got the migration. This is the essence of your suggestion.
  • Now, what if some shards do not respond? I think this is the biggest problem. Did they not respond and never ran that migration in the first place? Did they not respond and did run the migration before? Are those exactly the same shards not responding as those who didn't run the migration in the first place?
    • I'm trying to think if this is actually a simple question, or do we go down the rabbit hole of uncertainties.

Making -singleton -declarative migrations more idempotent (this is where I'm making an assumption - I'm assuming they aren't right now, in the edge case where the schema migration gets partially-applied)

If a migration is -declarative, then it is idempotent, whether -singleton or not. It's not automated to eventually run -- but if the user submits it again, then it is idempotent.

Alternatively, I'm wondering if you can make use of Vitess's two-phase commit when scheduling migrations on the _vt database of the shards, even if two-phase commit isn't enabled for the main user database. If not, could there maybe be a more simplistic, domain-specific implementation of two-phase commit or three-phase commit specifically for schema migration scheduling. If a user re-submits a schema migration query string, and that query string is found to be in the prepared to commit state on some of the shards, then the UUID is reused and the scheme migration scheduling continues where it left off.

Good ideas, I can look into that, and see following.

===

We did actually discuss (internally) last week (and BTW we wish to open these discussions to the greater scope of maintainers for visibility) the idea of a cross-shard coordinated DDL. It's a 2PC, basically, that will not only ensure all shards get the same migration, but may also coordinate the cut-over timing of a long running ALTER TABLE. So this idea results with shards being in sync (up to a few seconds) for all schema migrations.

Happy to hear your thoughts.

shlomi-noach avatar Apr 26 '21 05:04 shlomi-noach

If a migration is -declarative, then it is idempotent, whether -singleton or not. It's not automated to eventually run -- but if the user submits it again, then it is idempotent.

I just realized that I poorly worded my comment before about -singleton -declarative. So I'm not certain if this answers what I actually meant to ask. Let me try again.

Suppose the user submit a -singleton -declarative migration. It is scheduled on some shards, but not on other shards. VTGate returns an error.

Suppose the user immediately resubmits the migration. I assume that, since some of the shards already have a migration scheduled (this one), VTGate will respond with another error. (Before returning the error, VTGate may or may not have already scheduled the migration on the shards that didn't already have it - depends on the implementation details, which I don't know.)

Yes, -singleton -declarative is idempotent, but (if my assumption in the previous paragraph is correct), it is not immediately idempotent, because it will return an error rather than a UUID. Only eventually will you be able to resubmit the migration and get a successful no-op back.

With something like what I suggested (or perhaps alternative designs), the second request, submitted immediately after the first failed request, can succeed, return the UUID, and ensure the migration is running on all shards.

jmoldow avatar Apr 26 '21 05:04 jmoldow

Suppose the user immediately resubmits the migration.

If some of the shards are still down, then again there'll be an error, just like in the first time.

But if all shards now happen to be up?

Then, given -declarative, this is actually a new migration, with a new UUID, which will be automatically a no-op on the "successful shards".

Note, you can't run a ALTER TABLE in a -declarative DDL. You may only CREATE TABLE or DROP TABLE. So the successful shards see two identical CREATE TABLE statements one after the other. The first, they apply, the second, they realize "nothing to do". The failed shards only see the 2nd one, and apply it, and the schema converges with the schema on the successful shards.

shlomi-noach avatar Apr 26 '21 05:04 shlomi-noach

The first, they apply, the second, they realize "nothing to do".

👍

I take it that, even though you said "apply", this would also behave the same way even if the first CREATE TABLE migration is scheduled, but not yet applied? If there is an unapplied migration scheduled (or it is in-progress), the skeema tengo library will run with that scheduled CREATE TABLE as the source of the diff, rather than the CREATE TABLE schema of the live-right-now table?

jmoldow avatar Apr 26 '21 06:04 jmoldow

I take it that, even though you said "apply", this would also behave the same way even if the first CREATE TABLE migration is scheduled, but not yet applied? If there is an unapplied migration scheduled (or it is in-progress), the skeema tengo library will run with that scheduled CREATE TABLE as the source of the diff, rather than the CREATE TABLE schema of the live-right-now table?

If it's not a -singleton migration, then as you suggest, the migration may be queued (because perhaps a long running migration is still in progress).

tengo will only come to play when the migration is scheduled and executed (ie when previously long running migrations are complete). It will compare the user's requested CREATE TABLE schema with the existing schema at time of execution.

shlomi-noach avatar Apr 26 '21 06:04 shlomi-noach

During handling of the second request, VTGate determines that, within this keyspace, some of the shards have this query string already, and it is the most recently scheduled migration for that (set of) table(s), and they have the same migration UUID on all of those shards. The migration doesn't need to be re-applied on those shards, and the previous UUID can be returned at the end of the request. On the rest of the shards, the migration string is missing. In this case, the UUID can be re-used, and the migration scheduled on those shards.

I'm thinking this can/should be decided by tablets themselves. i.e., the user runs a first migration, it applies on some shards, other shards are broken. The user then resubmits an identical migration. Vitess-wise, it's a new migration. It has a new UUID. It gets sent to all shards. But then, each shard, independently, looks at the table affected, and checks to see whether the last successful migration to run on that specific table is exactly the same as the incoming migration (identical DDL text, string comparison). If so, then the tablet can say "well I've just completed that exact migration, no need for me to re-run it", and marks it as implicitly complete.

I think that approach is mostly safe. Some things to consider:

  • what if the user intentionally/accidentally directly tampered with the schema on one of the backend MySQL servers (e.g. the user dropped a table directly) and now wants to reapply a DDL (e.g. a CREATE TABLE)? Will Vitess just keep refusing to apply the migration saying "well according to my records this is already done"?
  • Should this be opt-in? Add a flag such as -ignore-duplicate-ddl or something?
  • Should this be opt-out? Add a flag such as -reapply-duplicate-ddl or something?
  • what if the user wants to periodically rebuild the table via alter table my_table engine=innodb? Should we exclude this form of query?

shlomi-noach avatar May 03 '21 06:05 shlomi-noach

In the case you describe (second migration is requested on the tablet, after the first identical migration has already succeeded on the tablet), we should probably always re-queue the second migration. As you said, not re-queueing in this case could cause issues.

In the case you describe, if the migration is -declarative, then in the typical case, the migration will complete trivially after being queued, so there's no advantage to skipping the queue. If the migration is non--declarative, it might not be straightforward or possible to be exactly certain of when it is safe to drop the duplicate, so we should probably not attempt to be idempotent in this case (so ignoring my suggestion of Making non-declarative migrations somewhat idempotent), and just do exactly what the user asked, which is to queue the migration again (and accept any errors that might return to the user; since most likely the repeat migration won't be able to be applied successfully). All the more reason for users to use -declarative.

====

Where I think there is still room for safe improvements, is this scenario:

  • Migration is requested
  • Some tablets do not queue the migration successfully, while others do
  • VTGate returns an error to the user
  • User requests the migration again
  • On at least one tablet, the original migration is still enqueued or in-flight (not succeeded/canceled/failed yet) when the new (identical) migration request arrives, and the original migration is also the most recent migration in the queue.

And I hypothesize that we can improve this scenario

  • When it is a non-declarative migration
  • When it is a -singleton -declarative migration

When it is a non-declarative migration, it is highly likely to not be idempotent, and therefore the second (identical) migration is likely to fail when applied on top of the successful first migration.

When it is a -singleton -declarative, we know for sure it is idempotent, and that the first migration is doing exactly the work that the second migration requested, making the second migration request redundant and trivial. If this were merely a -declarative migration, this case would be fine. But with -singleton -declarative, the second migration will be rejected (since the queue is already occupied on the tablet), which in my opinion is unnecessary, potentially a frustrating UX for a user who is trying to retry a partially-failed request, and makes the request technically not idempotent.

From what you've said, and from the implementation difficulties I pointed out, I think my idea of trying to intelligently re-use UUIDs should be rejected. However, I have a new idea now, that should also be able to handle the non-declarative and the -singleton -declarative cases.

====

Again, this is the scenario (it doesn't matter if the migration is -singleton or not, or whether it is -declarative or not):

  • Migration is requested
  • Some tablets do not queue the migration successfully, while others do
  • VTGate returns an error to the user
  • User requests the migration again
  • On at least one tablet, the original migration is still enqueued or in-flight (not succeeded/canceled/failed yet) when the new (identical) migration request arrives, and the original migration is also the most recent migration in the queue.

When the tablet detects that the migration is exactly identical to one that is already enqueued / in-flight, it allows the new one to be enqueued even if one or the other was a -singleton migration. This only applies if it is exactly identical. If it isn't, normal -singleton rules apply.

When a migration completes and is about to be marked as such, iterate through next elements of the queue in order, until you reach the end or a non-identical migration. For each identical, adjacent migration, mark them all as successfully completed.

This only applies to identical, adjacent migrations. And only applies for successful completion (not failure, or cancellation-via-UUID).

This procedure would make all Online DDLs fully idempotent, up until the point that it completes on at least one tablet. For a long migration, this would make quick retries idempotent and a feasible way of dealing with a partially-queued migration.

Also, this scheme means that the second migration UUID will not be marked as fully complete until the first migration UUID is completed/cancelled/failed. So watching the second UUID is sufficient, and it doesn't matter that the user might not know the first UUID (because that VTGate request failed with an error), and doesn't matter that the UUIDs weren't merged together. The user doesn't even have to know that there was a first UUID. They can pretend that the first request fully failed, even though that might not have been the case. (Assuming their response to the failure is to retry. If they want to revert back to the original state after the failed migration request, then they should assume nothing.)

jmoldow avatar May 09 '21 06:05 jmoldow

But with -singleton -declarative, the second migration will be rejected (since the queue is already occupied on the tablet), which in my opinion is unnecessary, potentially a frustrating UX for a user who is trying to retry a partially-failed request, and makes the request technically not idempotent.

That's a very good point.

For each identical, adjacent migration, mark them all as successfully completed.

I like the idea

This procedure would make all Online DDLs fully idempotent, up until the point that it completes on at least one tablet. For a long migration, this would make quick retries idempotent and a feasible way of dealing with a partially-queued migration.

The issue with UX here is that this is going to be inconsistent. Quick migrations behave differently than long running migrations. If the migration was fast enough to complete on a single shard, then (in non -declarative) what is the path forward? What can the user do? And, how should Vitess communicate the situation to the user?

shlomi-noach avatar May 09 '21 07:05 shlomi-noach

The issue with UX here is that this is going to be inconsistent. Quick migrations behave differently than long running migrations. If the migration was fast enough to complete on a single shard, then (in non -declarative) what is the path forward? What can the user do? And, how should Vitess communicate the situation to the user?

I don't have any perfect ideas, but here's some things that could be done to improve that UX.

For all of these, let's assume that we only trigger these special behaviors if an identical migration has run to successful completion on the tablet T "recently" (but possibly, not necessarily the last migration that ran).

Also, let's assume that we will run the duplicate migration (except in the first bullet immediately below). If it succeeds, then great! Only if it fails, do we try to take one of these new actions.

  • (During the queuing by VTGate) If the identical migration is still in-flight on some other shard in the keyspace, and is the most recent migration on that shard, then synchronously mark the new migration on T as succeeded, without queuing it or executing it.

  • If the migration is a CREATE TABLE migration, and it fails because the table already exists on tablet T, then ignore the error an mark the migration as succeeded. Do something similar for an add index / add column migration.

  • If the migration is a DROP TABLE or RENAME TABLE migration, and it fails because the table with that name doesn't exist anymore on tablet T (and, in the RENAME TABLE case, a table with the new name does exist), then ignore the error and mark the migration as succeeded. Do something similar for a drop/rename index/column migration.

  • If the identical migration is exactly the previous migration that ran on T, and the previous one completed successfully, and the current migration has failed a few retries, we can consider it high probability that it is safe to fold the migration into previous one, and mark it as succeeded. Since this is a heuristic and could be wrong, you could decide to tighten when this heuristic can be used. Perhaps only if the migration was very short, or if it ran very "recently", or an opt-in flag -ignore-duplicate-ddl is provided.

  • In all of these previous ideas, maybe instead of marking the new duplicate migration as succeeded, we can mark it with some new status, duplicate-skipped. If you just want to know, at a high level, whether the migration UUID succeeded or failed, it would be succeeded-with-some-duplicates-skipped (assuming the migration succeeded on all other shards).

  • As a catchall, if we don't try to be smart, and instead we keep it as a failure, if the identical migration has succeeded "recently", we can tell the user that in the error message.

jmoldow avatar May 09 '21 17:05 jmoldow

I feel like this is exploding/branching too far with if-else scenarios, so I'm going to attempt to converge a bit:

  • if it's a create table or drop table, all the user really needs to do is run the 2nd migration with -declarative
  • rename table not supported by Online DDL

Which leaves us with ALTER TABLE only.

Do something similar for an add index / add column migration.

This kind of analysis is risky. It means Vitess needs to interpret the ALTER statement, understand what it implies and then consult with the database to see what exists. Why it is risky? Because maybe the user adds column c and the database has column c, but their definitions do not match. Or maybe the definitions only match implicitly (the add column c statement does not indicate a DEFAULT, but the database does indicate one). There was an earlier, unrelated opportunity where we discussed, internally, that line of actions and decided against it.

Otherwise, the amount of heuristics keeps piling up.

I think the user will know better than us. If we provide a flag for the user (like you suggest, "an opt-in flag -ignore-duplicate-ddl") then that's best. We will send the DDL to all tablets. Those without identical query will attempt to run the migration. Those with identical query will do something along the lines you illustrated:

When a migration completes and is about to be marked as such, iterate through next elements of the queue in order, until you reach the end or a non-identical migration. For each identical, adjacent migration, mark them all as successfully completed.

Or even formalize that while the tablet submits the new migration (it will associate it with the pending migration, and no need to traverse adjacent migrations).

I want to reduce overall complexity. If the user doesn't want to use -declarative, I'd rather give the user the ownership to say "yes, that's fine".

shlomi-noach avatar May 10 '21 09:05 shlomi-noach

I feel like this is exploding/branching too far with if-else scenarios [....] I want to reduce overall complexity.

Sure, that makes sense to me, as far as the "handling duplicates after the first migration has already completed" scenario is concerned. I was just listing various ideas that could potentially be explored if you found any of them worthwhile.

Or even formalize that while the tablet submits the new migration (it will associate it with the pending migration, and no need to traverse adjacent migrations).

What do you think that association might look like?

If the user doesn't want to use -declarative, I'd rather give the user the ownership to say "yes, that's fine".

Should Vitess attempt to give a more informative error message back to the user, such as something similar to what I said in my last two bullet points of https://github.com/vitessio/vitess/pull/7825#issuecomment-835850721 ?

jmoldow avatar May 14 '21 05:05 jmoldow

What do you think that association might look like?

As a fan of the relational model I'd like to have a column named identical_migration_uuid or similar, with which we can immediately find all previous or future submission of identical migrations.

Should Vitess attempt to give a more informative error message back to the user,

Yeah, that sounds good. IMO if either a former or later "identical" migration is complete (successful), we should mark the other one as complete, too. We could do so retroactively or going forward upon completion of any migration. Otherwise, if a migration fails, we can definitely add some error message in vitess that says "this looks to be identical to that other migration, do you want to try again with -ignore-duplicate-ddl" or similar.

Also, I am thoroughly enjoying this brainstorming :heart:

shlomi-noach avatar May 18 '21 09:05 shlomi-noach

would only run on -40 shard. Should we change this behaviour or add tests for it?

@GuptaManan100 it is tested here: https://github.com/vitessio/vitess/pull/7825/files#diff-ab7165420ce0f3d6fffdbba015747400b9621515ffec36c74adab722bc5f2defR168-R172

shlomi-noach avatar May 19 '21 05:05 shlomi-noach

I just converted to Draft because I want to keep thinking about the lengthy discussion above and all of the options/alternatives to this approach.

shlomi-noach avatar May 19 '21 05:05 shlomi-noach

@jmoldow can you please review https://github.com/vitessio/vitess/pull/8209, followup to the above discussion, and provide feedback?

shlomi-noach avatar May 31 '21 05:05 shlomi-noach

I'm still stalling with this PR because I'm still unsure we've found the best path to address the issue of mass deployments. I don't want to introduce a syntax that will turn into "backwards compatibility" liability. Giving this more time for thought.

shlomi-noach avatar Jul 14 '21 08:07 shlomi-noach

@deepthi I'm keeping this one open because we still want to find a good solution, and the discussion here is priceless.

shlomi-noach avatar Aug 29 '21 06:08 shlomi-noach

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions[bot] avatar Jul 20 '22 01:07 github-actions[bot]

A new take: https://github.com/vitessio/vitess/pull/10915 keeps better consistency across shards.

shlomi-noach avatar Aug 03 '22 06:08 shlomi-noach

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

github-actions[bot] avatar Sep 03 '22 01:09 github-actions[bot]

This PR was closed because it has been stale for 7 days with no activity.

github-actions[bot] avatar Sep 10 '22 01:09 github-actions[bot]