vitess icon indicating copy to clipboard operation
vitess copied to clipboard

RFC: VTTablet Schema API

Open rohit-nayak-ps opened this issue 2 years ago • 0 comments

Feature Description

VTTablet Schema API

Working Draft

TL;DR

VTTablet Schema API is a new vttablet service to watch, get and reload mysql schemas. It will replace the current VTGate and VReplication schema tracker functionality. The table schemas will contain a subset of table attributes required by vtgate and vreplication. The current VReplication schema tracker mechanism will be modified to provide the implementation. vtgate will continue to subscribe to vttablet's healthcheck to get list of changed tables, but use the new API to get schema for changed tables, instead of the mechanism in vttablet's health stream.

Motivation

There are two schema tracking mechanisms today: the VTGate Schema Tracker and the VReplication Schema Versioning (aka VReplication SchemaTracker/Historian), each of which stores the schema in different _vt tables. In addition, tabletserver has its own periodic and on-demand schema loading, into in-memory structures.

There are limitations faced by consumers of these schema functionalities in vttablet:

  • Online DDL is forced to reload the entire schema via the tablet SchemaManager even though it affects a single table
  • The VTGate tracker reloads the schema periodically (not when it changes) and hence it can be out-of-sync
  • VTGate doesn’t use an RPC but directly queries the underlying _vt tables, making it brittle wrt _vt schema changes
  • The vreplication tracker stores the entire schema as a blob for any schema change, which is wasteful in clusters where there are a huge number of tables with frequent schema changes
  • Tablet schema engine reloads all tables, which can take significant amount of time for large number of tables.

Goal

  • Single schema tracking mechanism with VTTablet providing an external and internal API
  • VTGate schemas should be updated via a push rather than a pull
  • Ability to reload only changed tables rather than the entire schema

Caveat:

  • Some applications might need to query mysql for information not stored by this API (like extended attributes).

API

  • GetSchema(keyspace string, tables []string, gtid string) ([]*Schema, error)
  • Reload(keyspace, tables []string, onlyChanged bool) (altered, added, dropped []string, error)

Note:

  • Empty gtid => latest
  • Empty table list => all tables

Schema

  • KeyspaceName string
  • TableSchema string
  • Tables []Table // points to latest version
  • UpdatedAt datetime

SchemaTable

  • TableName string
  • Columns []Column
  • PKs []int // denormalizing pks since vreplication needs it always
  • CreateDDL string // allows consumers parse ddl for collation/charset info and indexes etc
  • GTID string // or Position?
  • Version

Column //subset of attributes

  • ColumnName string
  • Ordinal int
  • DataType string
  • Collation string // defer to SchemaTable.CreateDDL?
  • Charset string // defer to SchemaTable.CreateDDL?
  • ColumnKey string // defer to SchemaTable.CreateDDL?

SchemaVersion

  • Version bigint
  • Ddl string //null for initial load, otherwise ddl that lead to the new version

Design notes:

  • Need to version tables. Is SchemaVersion a good design choice?
  • Tables might be present in some versions only (deleted tables)
  • Efficiently get a snapshot of the schema at a specific position
  • Quick access to the latest schema

Assumptions

  • The vreplication tracker will need to be enabled for this mechanism to work. Should this be default?

Implementation

Summary of code changes required:

VReplication Tracker

This needs significant refactoring: currently all tables are stored per schema change along with the ddl that caused the change and the gtid at which it happened. We will need to map this to the new data model where only the affected table is stored. The code to load these tables into the in-memory structure required for the historian will have to be written afresh.

Tablet Schema Engine

Currently reloads all tables. We can identify changed tables similar to how vtgate tracker is doing now.

VTGate

Subscription to healthcheck does not change, however instead of updating the schema using direct queries to vttablet, from within the vtgate tracker, vtgate will call the new API.

Online DDL

No change. As a side-effect we will only load changed tables, effectively improving performance.

Deprecation and Deployment

VSchema Authoritative Schema

The VSchema authoritative schema doesn't include collations today, which are already used by VTGate (obtained via its tracker). In addition we could potentially be needing further column info for upcoming applications (like indexes). We need to decide whether to extend the authoritative schema for these or deprecate it entirely. Earlier we did not have a mechanism in vtgate to efficiently maintain table schemas and hence added support for the authoritative schema. Maintaining this schema is fragile since it needs to reflect new DDLs and will, in any case, race with online ddl's cutover. So can we deprecate it, or is there some reason to continue to support it?

_vt.schemacopy and related Code

This table will no longer be used once we deploy the Schema API and can potentially be dropped in a following release.

_vt.schema_versions and related Code

This table will be replaced with the new set of tables required to support the Schema API and can potentially be dropped in a following release.

Upgrade process

We will need to migrate the data in _vt.schema_versions to the new model.

Current status (as of v15)

VReplication

Stores the entire schema for each DDL and provides an internal API to get the schema for a table at a particular GTID. This is useful while streaming older events where the binlog images are from older table structures.

For each DDL, a list of binlogdatapb.MinimalSchema objects are stored in _vt.version_schema, with the associated gtid. The information stored per table are the table name, the query.Field object per column and primary key columns for that table.

VTGate

VTGate uses its schema tracker (optionally) as a means to directly define the authoritative list of columns it needs to perform certain queries (see https://vitess.io/docs/14.0/reference/features/schema-tracking/). The alternative is for the user to define this list manually in the vschema.

The VTGate tracker subscribes to the vttablet HealthChecks (primary-only?) and on a VSchema change updates the VTGate’s local copy of the schema. It uses the column name, data type and collation only.

Online DDL

Online DDL reloads the vttablet schema on a cutover. This currently reloads all tables.

rohit-nayak-ps avatar Nov 03 '22 19:11 rohit-nayak-ps