vitess
vitess copied to clipboard
RFC: VTTablet Schema API
Feature Description
VTTablet Schema API
Working Draft
TL;DR
VTTablet Schema API is a new vttablet service to watch, get and reload mysql schemas. It will replace the current VTGate and VReplication schema tracker functionality. The table schemas will contain a subset of table attributes required by vtgate and vreplication. The current VReplication schema tracker mechanism will be modified to provide the implementation. vtgate will continue to subscribe to vttablet's healthcheck to get list of changed tables, but use the new API to get schema for changed tables, instead of the mechanism in vttablet's health stream.
Motivation
There are two schema tracking mechanisms today: the VTGate Schema Tracker and the VReplication Schema Versioning (aka VReplication SchemaTracker/Historian), each of which stores the schema in different _vt
tables. In addition, tabletserver has its own periodic and on-demand schema loading, into in-memory structures.
There are limitations faced by consumers of these schema functionalities in vttablet:
- Online DDL is forced to reload the entire schema via the tablet SchemaManager even though it affects a single table
- The VTGate tracker reloads the schema periodically (not when it changes) and hence it can be out-of-sync
- VTGate doesn’t use an RPC but directly queries the underlying
_vt
tables, making it brittle wrt_vt
schema changes - The vreplication tracker stores the entire schema as a blob for any schema change, which is wasteful in clusters where there are a huge number of tables with frequent schema changes
- Tablet schema engine reloads all tables, which can take significant amount of time for large number of tables.
Goal
- Single schema tracking mechanism with VTTablet providing an external and internal API
- VTGate schemas should be updated via a push rather than a pull
- Ability to reload only changed tables rather than the entire schema
Caveat:
- Some applications might need to query mysql for information not stored by this API (like extended attributes).
API
- GetSchema(keyspace string, tables []string, gtid string) ([]*Schema, error)
- Reload(keyspace, tables []string, onlyChanged bool) (altered, added, dropped []string, error)
Note:
- Empty gtid => latest
- Empty table list => all tables
Schema
- KeyspaceName string
- TableSchema string
- Tables []Table // points to latest version
- UpdatedAt datetime
SchemaTable
- TableName string
- Columns []Column
- PKs []int // denormalizing pks since vreplication needs it always
- CreateDDL string // allows consumers parse ddl for collation/charset info and indexes etc
- GTID string // or Position?
- Version
Column //subset of attributes
- ColumnName string
- Ordinal int
- DataType string
- Collation string // defer to SchemaTable.CreateDDL?
- Charset string // defer to SchemaTable.CreateDDL?
- ColumnKey string // defer to SchemaTable.CreateDDL?
SchemaVersion
- Version bigint
- Ddl string //null for initial load, otherwise ddl that lead to the new version
Design notes:
- Need to version tables. Is SchemaVersion a good design choice?
- Tables might be present in some versions only (deleted tables)
- Efficiently get a snapshot of the schema at a specific position
- Quick access to the latest schema
Assumptions
- The vreplication tracker will need to be enabled for this mechanism to work. Should this be default?
Implementation
Summary of code changes required:
VReplication Tracker
This needs significant refactoring: currently all tables are stored per schema change along with the ddl that caused the change and the gtid at which it happened. We will need to map this to the new data model where only the affected table is stored. The code to load these tables into the in-memory structure required for the historian will have to be written afresh.
Tablet Schema Engine
Currently reloads all tables. We can identify changed tables similar to how vtgate tracker is doing now.
VTGate
Subscription to healthcheck does not change, however instead of updating the schema using direct queries to vttablet, from within the vtgate tracker, vtgate will call the new API.
Online DDL
No change. As a side-effect we will only load changed tables, effectively improving performance.
Deprecation and Deployment
VSchema Authoritative Schema
The VSchema authoritative schema doesn't include collations today, which are already used by VTGate (obtained via its tracker). In addition we could potentially be needing further column info for upcoming applications (like indexes). We need to decide whether to extend the authoritative schema for these or deprecate it entirely. Earlier we did not have a mechanism in vtgate to efficiently maintain table schemas and hence added support for the authoritative schema. Maintaining this schema is fragile since it needs to reflect new DDLs and will, in any case, race with online ddl's cutover. So can we deprecate it, or is there some reason to continue to support it?
_vt.schemacopy
and related Code
This table will no longer be used once we deploy the Schema API and can potentially be dropped in a following release.
_vt.schema_versions
and related Code
This table will be replaced with the new set of tables required to support the Schema API and can potentially be dropped in a following release.
Upgrade process
We will need to migrate the data in _vt.schema_versions
to the new model.
Current status (as of v15)
VReplication
Stores the entire schema for each DDL and provides an internal API to get the schema for a table at a particular GTID. This is useful while streaming older events where the binlog images are from older table structures.
For each DDL, a list of binlogdatapb.MinimalSchema objects are stored in _vt.version_schema, with the associated gtid. The information stored per table are the table name, the query.Field object per column and primary key columns for that table.
VTGate
VTGate uses its schema tracker (optionally) as a means to directly define the authoritative list of columns it needs to perform certain queries (see https://vitess.io/docs/14.0/reference/features/schema-tracking/). The alternative is for the user to define this list manually in the vschema.
The VTGate tracker subscribes to the vttablet HealthChecks (primary-only?) and on a VSchema change updates the VTGate’s local copy of the schema. It uses the column name, data type and collation only.
Online DDL
Online DDL reloads the vttablet schema on a cutover. This currently reloads all tables.