atlasdb
atlasdb copied to clipboard
CassandraVerifier#waitForSchemaVersions blocks on Cassandra 3 upgrade
https://github.com/palantir/atlasdb/blob/develop/atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraVerifier.java#L213
The CassandraVerifier
waits for schema agreement, which is generally a smart thing to do when creating a keyspace. However, in the case of a Cassandra 3 upgrade, the schemas will be mis-aligned until all of the nodes are fully upgraded. This can potentially take as long as the time to rewrite every sstable.
While that function is waiting, services cannot be started and backups cannot be taken. So the status quo is explicitly not possible to maintain.
Some paths forward:
- Remove this check entirely. Need to understand dangers here.
- Disable this check via config. Since config is set at the client level, this could potentially be enabled/disabled dynamically. Only disable for the Cassandra 3 upgrade. What happens if services try to create keyspaces/upgrade schemas at this time?
- Find another form of schema compatibility check that doesn't flag on Cassandra 3 upgrade. If it doesn't exist, write it into Cassandra?
Are they misaligned in a decidable way?
In theory it should be determinable which nodes are on which version of Cassandra, and align that on schema version mismatches to determine if that is the cause. Example (top three nodes are on C*3):
WARN [2020-03-31T15:50:08.874099Z] com.palantir.atlasdb.keyvalue.cassandra.CassandraVerifier: Couldn't use host {} to create keyspace. It returned exception "{}" during the attempt. We will retry on other nodes, so this shouldn't be a problem unless all nodes failed. See the debug-level log for the stack trace. (host: xx) (exceptionMessage: java.lang.IllegalStateException: Cassandra cluster cannot come to agreement on schema versions, while checking if schemas diverged on startup.
At schema version c95060a4-47f9-3a58-b230-808818ba043c:
Node: 1.x.x.36
Node: 1.x.x.224
Node: 1.x.x.186
At schema version e1243782-0562-3675-869c-de8ff87e799d:
Node: 1.x.x.69
Node: 1.x.x.118
Node: 1.x.x.160
Node: 1.x.x.66
Node: 1.x.x.155
Node: 1.x.x.234
This is actually slightly less of a concern than I originally thought.
Schema version is tied to the binary upgrade, not the sstable upgrade. So the time of impact is the time it takes to upgrade the entire cluster, rather than the time to rewrite the sstables.