vitess
vitess copied to clipboard
RFC: Orchestrator as a native Vitess component
Introduction
With #6582, we are starting on a Vitess-specific fork of the Orchestrator that was originally created by @shlomi-noach.
Previously, Vitess provided an integration to loosely cooperate with an install of Orchestrator managing automated failover for the MySQL instances underlying Vitess. However, users often encountered corner cases where the two sides could make conflicting changes, potentially leaving the system in an inconsistent state.
We intend to resolve these issues and provide a smooth and unified experience to the end-user.
Approach
Many approaches were considered:
- Build a Vitess-native failover mechanism.
- Copy portions of the Orchestrator code into Vitess.
- Extend Orchestrator to allow tighter integration and cooperation with Vitess.
- Modify Orchestrator to become a first-class Vitess component.
Of the above approaches, we have decided on option 4. We reached this conclusion after studying the Orchestrator code, and realizing that it is written with the intent to allow customization.
Architecturally, the Orchestrator is closest in function to the work performed by vtctld and has some overlapping functionality. For example, stop-replica
vs StopReplica
, graceful-master-takeover
vs PlannedReparentShard
.
On the other hand: For massive clusters with >10K nodes, it may be better to divide the Orchestrator's responsibility into specific keyspaces or shards. In this situation, it starts to deviate from the vtctld model. This requires further brainstorming.
To fork or not to fork
We debated extending the Orchestrator to accommodate Vitess into the existing code base, vs. forking the code. The conclusion was to fork due to the following reasons:
- Model: Orchestrator’s failover code depends on the discovered replication graph that connected replicas to their master. This model is substantially different from Vitess where cluster membership is strongly dictated by the keyspace-shard an instance belongs to.
- Responsibility: Orchestrator is primarily responsible for performing failovers. However, this separation gave rise to many corner cases, especially when Vitess would wire-up replication using a view of the world that differed from what Orchestrator saw. For this integration to work smoothly, the Orchestrator needs to take responsibility for all functions of the cluster.
- Opinions: Orchestrator has used a hands-free approach with the intent to accommodate with the desires of how administrators would like to configure their MySQL instances. In Vitess, there is an expectation that a cluster should be mostly self-managed, with certain critical state transitions occurring manually. So, many of the flexibilities that Orchestrator accommodates are not allowed in Vitess.
- Velocity: If we forked, we can move faster because there is no fear of breaking use cases that are not supported by Vitess.
End Product
The end product could be a unified component that merges Orchestrator and vtctld. We could either preserve the old vtctld name, or give a brand new name for the new binary.
This will extend vtctld’s responsibility to not only execute user commands, but also initiate actions against the cluster to remedy any problems that are automatically solvable.
Also, vtctld is designed to cooperate with other vtctlds present. This is in line with the Orchestrator's philosophy. Although they use different methods right now, they will be unified at the end.
As for the web UI, the Orchestrator has a pleasant and intuitive graphical interface. This will be used as the starting point, and the vtctld functionality will be adapted around this as starting point.
There should be links to vttablets from Orchestrator, similar to how there are links from the vtctld web UI to vttablets today.
Orchestrator also has a more structured CLI interface, which will likely be preferred over the one that vtctld organically grew.
If we decide to not merge with vtctld, we'll have to consider either keeping Orchestrator as a separate component, or see what it will look like if it ran inside vttablets.
Pluggable Durability
Users come with varying tolerances on durability. Additionally, cloud environments define complex topologies. For example, AWS has zones and regions. In order to meet these requirements, and future-proof ourselves, we will implement a pluggable durability policy. This will be based on an extension of FlexPaxos.
Essentially, this pluggability will allow us to address all existing known use cases as well as future ones we have not heard of. More details will follow.
Details
This is a preliminary drill-down of what we think is needed. So, it’s presented as somewhat of a laundry list. We will refine this as we move forward.
MVP
A Minimum Viable Product is being developed. The latest code is currently in https://github.com/planetscale/vitess/tree/ss-oc2-vt-mode. It’s likely to evolve. Those who wish to preview upcoming work can look for branches in the PlanetScale repo that are prefixed with ss-oc
. At this point, the MVP has the following functionality:
- Automatically discover all tablets and MySQL instances by walking the Vitess topology.
- Assign default rules for discovered items:
- Promotion rules
- DataCenters
- Automatically initialize a brand new cluster (replacing InitShardMaster).
- When a new vttablet comes up with its MySQL, wire it up to replicate from the current master.
- GracefulTakeOver performs the same function as PlannedReparentShard
- Dead-master recovery, with Orchestrator directly performing the
TabletExternallyReparented
work. This still does not include everything thatEmergencyReparentShard
does today, but should be extended. - Heal the cluster if tablets and MySQL instances do not agree on who is the master
- Ensure masters and replicas have the correct read/write permissions and ensure they are connected correctly.
- Discovery scans per cell and handles the case of a whole cell being down.
The MVP is only a starting point. There is more work that needs to be done:
Code culture
- Port existing integration tests, make them pass and make them part of CI/CD
- Golint, govet, etc: These are currently failing, and need to be fixed.
- Write new tests to meet Vitess coding standards (coverage, etc).
- Context: Orchestrator does not use Golang contexts. But Vitess calls need them. So, we need to make sure we wire up all the TODO contexts.
- Change to use Vitess code infrastructure: servenv, logging, etc.
Metadata
- Most of the metadata state internally maintained by Orchestrator can be automatically reconstructed through Vitess’ discovery * functionality. However, there are a few settings that need to be persisted. For example, states like Maintenance and Downtime * * Windows. These should be moved into the Vitess topo.
Improvements
- Use Vitess GTID code locally to compare GTIDs instead of calling into MySQL.
- Some retries may need an exponential back-off.
- Use persistent connections to tabletmanager to avoid frequent reconnects.
Changes in behavior
- DeadMaster: Currently uses file:pos to find the most progressed replica. This will not work if replicas can be detached. This should be changed to use GTID comparison.
- DeadMaster: Should use all instances of a cluster instead of just the ones in the replication graph.
- Topology Recovery feeds into Vitess’s event log.
- Enhanced security: Orchestrator should receive replication credentials from vttablets.
New functionality
- Ability to mark a cell as down.
- Subscribe to Tablet streaming health to get real-time updates on changes in tablet states.
- Let vttablet’s streaming health provide all the MySQL info instead of Orchestrator fetching from two sources?
- Watch vttablets and reparent on their failure (even if MySQL is up).
- Support Multi-schema: multiple vttablets pointed at the same MySQL instance.
- Rewind/backtrack: Detect and fix extraneous GTIDs from recovered masters.
- Tags: Publish discoverable tags to the vttablets that will allow vtgates to change their routings accordingly; instead of fixed tablet types.
- Pluggable durability
- Go through vttablet for all actions so everything is uniform
- Look at moving the recover logic inside vttablet to better conform to the consensus doc
Adding:
- Leadership: today,
orchestrator
itself is HA with either shared MySQL backend orraft
consensus protocol to support leadership. Invitess
the back end will be Global Topo. Global topo will implicitly dictateorchestrator
leadership in face of network isolation. - Controlled leadership: assuming stable and healthy topo, we want to be able to ask for a specific
orchestrator
node to become the leader. For example, it's common in many setups to have one cell (cloud vendor? AZ?) to be "more active" than the others. e.g. a Cell where moremaster
servers are located; reasons are typically to get low latency with local apps/clients. In such scenario it is also advisable, though not stricctly required, to have theorchestrator
leader in same Cell. - Scaling:
orchestrator
will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions fromorchestrator
to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend,orchestrator
is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allowsorchestrator
to collect almost all information while probing the servers (e.g.promotion-rule
s). Information likedowntime
can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite withfile::memory:?cache=shared
setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.
Adding the TODO list for tracking -
- [x] Logging to use vt/log
- [x] serenv package use and populate debug/vars, etc
- [x] Disable/Enable Global recoveries
- [ ] Go context usage
- [ ] Port integration tests
- [ ] Code coverage tests
- [x] Unused code cleanup
- [ ] Support vttablet failures
- [ ] Durability policy work tracked in #8975
- [ ] Subscribe to tablet health-checks
- [x] Parameterize timeouts & intervals
- [x] route queries through vttablets
- [x] Federation
- [ ] Find alternate for AttemptRecoveryRegistration and throttling
- [ ] Remove replication manager
- [x] API and Config cleanup
- [ ] Audit Topology Recovery
- [x] VTOrc with backups
- Scaling:
orchestrator
will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions fromorchestrator
to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend,orchestrator
is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allowsorchestrator
to collect almost all information while probing the servers (e.g.promotion-rule
s). Information likedowntime
can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite withfile::memory:?cache=shared
setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.
If we still want to keep the persistence, SQLite scales quite well in WAL mode: https://sqlite.org/wal.html which just needs to be enabled. It still only allows one writer at a time, but it allows reads to occur concurrently with any other operations. It eliminates the "database is busy" errors which plague anyone trying to use SQLite from multiple threads or processes. The in-memory mode will produce fewer locked/busy errors, but it still has a coarse-grained locking model so it can still produce those errors. WAL mode uses snapshot isolation to avoid them. That also means if we're explicitly or implicitly dependent on the very strict isolation default of SQLite, we may need to use more explicit concurrency management (i.e. SELECT FOR UPDATE
).
Edit: looks like Orchestrator already does this! https://github.com/openark/orchestrator/blob/de1b1ecd3f65cac447b24067d99dc56a8109fd82/go/db/db.go#L354
WAL mode scales very well, it can scale to hundreds of thousands of processes and millions of reads/s, but maybe not at the same time? :D
Edit: looks like Orchestrator already does this!
It does!