vitess icon indicating copy to clipboard operation
vitess copied to clipboard

Bug Report: Unique Constraints issues when consolidating shards with "hot" rows

Open lizztheblizz opened this issue 1 year ago • 5 comments

Overview of the Issue

When using MoveTables to consolidate shards, either by running SwitchTraffic while moving from 2 to 4 shards (where it affects the _reverse workflow), or during the normal operation of a MoveTables command moving from a 2-shard to a 4-shard keyspace, it is possible for vreplication to break because of unique keys, making it near impossible to recover.

Scenario:

  • One table, primary key + secondary unique key applied on the MySQL level
  • One secondary lookup vindex table using consistent_lookup_unique
  • Topology
    • 4-sharded keyspace (-40,40-80,80-c0,c0-) as source
    • 2-sharded keyspace (-80,80-) as target
    • Same vschema across both. We are simply consolidating shards
  • High volume write scenario on the source shards with the following pattern:
    • writer(-40)
      • insert into table1(uid) values (25); --> inserts row with values id=1, uid=25
      • ...
      • delete from table1 where id=1;
    • writer(40-80)
      • insert into table1(uid) values (25); --> inserts row with values id=2, uid=25
      • ...
      • delete from table1 where id=2;
    • writer(-40)
      • insert into table1(uid) values (25); --> inserts row with values id=3, uid=25
      • ...
      • delete from table1 where id=3;
    • writer(40-80)
      • insert into table1(uid) values (25); --> inserts row with values id=4, uid=25
      • ...
      • delete from table1 where id=4;
    • etc.

Because the vstreams for -40 and 40-80 are executing independently from eachother, when consolidating these streams back into -80, it is very easy for one of those vstreams to fall behind the other, and fail perpetually thereafter. The error we observe is a unique constraint error on the MySQL side.

A workaround for the above scenario could be to run the vstream sessions with SET unique_checks=0;. This should be a safe operation in this case, because we still have our secondary lookup vindex table enforcing uniqueness across shards at the source side. I'm unclear if it would be considered safe in all unique constraint scenarios.

Reproduction Steps

  1. Using the following vschema:
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    },
  "uid_vdx": {
      "type": "consistent_lookup_unique",
      "params": {
        "from": "uid",
        "table": "ks.uid_vdx",
        "to": "keyspace_id"
      },
      "owner": "table1"
    },
  }
  "tables": {
    "table1": {
      "column_vindexes": [
        {
          "column": "id",
          "name": "hash"
        },
        {
          "column": "uid",
          "name": "uid_vdx"
        }
      ]
    },
    "uid_vdx": {
      "columnVindexes": [
        {
          "column": "uid",
          "name": "hash"
        }
      ]
    }
  }
}
  1. Using the following schema:
CREATE TABLE table1 (
id int not null,
uid int not null,
primary key (id),
unique key uid(uid)
)

Assume a sequence table exists in an unsharded keyspace to help populate id.

Binary Version

2023-09-26.0a5a570 (v16.0.3)

Operating System and Environment details

PlanetScaleDB k8s environment

Log Fragments

The following vreplication streams exist for workflow ks2.Reshard_reverse:

id=23 on -80/gcp_uscentral1c_1-3344908643: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:30 2023.
id=24 on -80/gcp_uscentral1c_1-3344908643: Status: Error: Duplicate entry '25' for key 'table1.uid' (errno 1062) (sqlstate 23000) during query: insert into table1(id,uid) values (3,25).
id=23 on 80-/gcp_uscentral1c_1-0264185158: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:31 2023.
id=24 on 80-/gcp_uscentral1c_1-0264185158: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:31 2023.

lizztheblizz avatar Oct 17 '23 10:10 lizztheblizz