vitess icon indicating copy to clipboard operation
vitess copied to clipboard

Idea: Online DDL syntax to postpone execution of migrations, and per-shard execution

Open shlomi-noach opened this issue 1 year ago • 3 comments

We hit a use case where we wanted to run a migration on a multisharded keyspace, but first only wanted to run the migration on a single shard, to validate the results, by way of reducing the blast radius. Right now there is no support for this in Vitess other than manually injecting entries into _vt.schema_migrations. At the time, we had this idea: https://github.com/vitessio/vitess/pull/7825, but I think it's a wrong approach, because it can lead to discrepancies between shards.

I'd like to suggest this new idea: consider the following command:

set  @@ddl_strategy='online --postpone-execution`;
alter table t add column i int;

We can introduce a --postpone-execution flag. With this flag, the migration does not get auto-executed. It is submitted, queued, but never started. Contrast this with the existing --postpone-completion flag, where the migration starts but runs forever or until told to complete.

Then, we add the following syntax and variations:

ALTER VITESS_MIGRATION '<uuid>' EXECUTE;
ALTER VITESS_MIGRATION '<uuid>' EXECUTE SHARD '<shard-name>';
ALTER VITESS_MIGRATION EXECUTE ALL;

This is symmetrical to ALTER VITESS_MIGRATION COMPLETE|CANCEL but with the addition of a per-shard instruction (which we can later also add to COMPLETE|CANCEL).

With this new syntax, we will be able to submit a migration, which is queued on all shards, making the queue consistent, but we can then choose to only start it on a single, specific shard. When satisfied, we can then proceed to execute it on all shards.

We take upon ourselves the risk of having an inconsistent schema between the different shards -- this is really most useful to adding/modifying indexing.

This will be a pretty straightforward development; it will work well with all existing scheduling options; it will be possible to mix --postpone-execution with --postpone-completion for example (though it does sound a little bit over the top to do this mixture).

Thoughts welcome.

shlomi-noach avatar Aug 01 '22 06:08 shlomi-noach

This sounds great, and much safer than #7825. We'll still have put in BOLD ALL-CAPS that people are assuming the risk of inconsistent schema. @vitessio/query-serving what sort of spurious bug reports might we get from this? And are there things we can do in code to head those off?

deepthi avatar Aug 01 '22 16:08 deepthi

And are there things we can do in code to head those off?

I'm just thinking that ALTER VITESS_MIGRATION '<uuid>' EXECUTE SHARD '<shard-name>' could intentionally generate a Warning to let the user know about safety or lack thereof.

shlomi-noach avatar Aug 01 '22 16:08 shlomi-noach

Please see https://github.com/vitessio/vitess/pull/10915 for an implementation (new term is "postpone launch" because "execute" is such an overloaded term).

shlomi-noach avatar Aug 03 '22 06:08 shlomi-noach