Make DESCRIBE SCHEMA safe for restoring schema
Scylla docs suggest that restoring schema should be done via CQL and DESCRIBE SCHEMA, but it has been known to be dangerous for a long time.
One of the problems with the output of DESCRIBE SCHEMA is that it does not contain information about dropped columns. When restoring both schema and user data, the lack of information about dropped columns might cause incorrect interpretation of user data that still contains rows with those columns.
That's the main reason why Scylla Manager restores schema via sstables and not CQL, but I'm not sure if it is the only one.
Restoring schema via sstables is less tested, but more problematic and dangerous (see this issue). For this reason, it would be great to change DESCRIBE SCHEMA so that its output can be trusted for restoring backups.
AFAIR DESCRIBE SCHEMA isn't a part of CQL protocol, but some additional feature provided by cqlsh, so the changes to DESCRIBE SCHEMA would also need to be applied to corresponding driver methods. E.g. in gocql we have KeyspaceMetadata which works like DESCRIBE SCHEMA for a given keyspace:
func (s *Session) KeyspaceMetadata(keyspace string) (*KeyspaceMetadata, error) { ... }
cc: @mykaul @karol-kokoszka @kbr-scylla @tgrabiec
(Incomplete) list of things missing from DESCRIBE SCHEMA:
- dropped columns
- gc_mode (#14390)
@kostja - is this something your team should work on?
We have DESC SCHEMA WITH INTERNALS which is supposed to add dropped columns. We have a test for DESC TABLE WITH INTERNALS in test/cql-pytest/test_describe.py but I don't think we have a test for DESC SCHEMA WITH INTERNALS.
Perhaps if there is resistence in https://github.com/scylladb/scylladb/issues/14390 to add gc_mode to the general DESC SCHEMA, it can be added to just DESC SCHEMA WITH INTERNALS?
AFAIR
DESCRIBE SCHEMAisn't a part of CQL protocol, but some additional feature provided bycqlsh, so the changes toDESCRIBE SCHEMAwould also need to be applied to corresponding driver methods.
This has changed recently (in fact, I'm not sure it was changed already, @mykaul ?) - in the old days it was provided by cqlsh, in modern cqlsh it isn't - it uses the "DESCRIBE" CQL command. It's a good question which of the two we support today - I'm confused already. The test/cql-pytest/test_describe.py I mentioned earlier tests the server-side.
In any case, as far as I know, there are no "driver methods" to do a "describe schema" - the driver only lets you read the schema tables, from which you can build a CQL command if you want, but you have no functions to do it. Only cqlsh contained code which used the schema tables to build a CQL command - and this was the "DESCRIBE SCHEMA" people got used to.
E.g. in gocql we have
KeyspaceMetadatawhich works likeDESCRIBE SCHEMAfor a given keyspace:func (s *Session) KeyspaceMetadata(keyspace string) (*KeyspaceMetadata, error) { ... }
Does it create CQL commands (like cqlsh) or just some internal representation of the schema? If it's the latter, it's not exactly the same thing.
Does it create CQL commands (like cqlsh) or just some internal representation of the schema? If it's the latter, it's not exactly the same thing.
When combined with ToCQL it returns CQL commands just like DESCRIBE SCHEMA (in general this approach ineed works by querying system_schema tables and creating CQL commands on driver side).
// ToCQL returns a CQL query that ca be used to recreate keyspace with all
// user defined types, tables, indexes, functions, aggregates and views associated
// with this keyspace.
func (km *KeyspaceMetadata) ToCQL() (string, error) {
Not sure it was done - https://github.com/scylladb/scylla-cqlsh/issues/17
We have DESC SCHEMA WITH INTERNALS which is supposed to add dropped columns.
Does it work? I tried to run those commands viq clqsh with Scylla 5.4.0 with raft enabled and the output didn't contain dropped columns info.
cassandra@cqlsh> CREATE TABLE IF NOT EXISTS mks.mt2 (id int PRIMARY KEY, data text);
cassandra@cqlsh> insert into mks.mt2 (id, data) values (1, 'a');
cassandra@cqlsh> alter table mks.mt2 drop data;
cassandra@cqlsh> DESCRIBE TABLE mks.mt2 WITH INTERNALS
CREATE TABLE mks.mt2 (
id int PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
cassandra@cqlsh> select * from system_schema.dropped_columns;
keyspace_name | table_name | column_name | dropped_time | type
---------------+------------+-------------------------------+------------------+------
mks | mt | data | 1703168023998775 | text
mks | mt2 | data | 1703168454815962 | text
system | local | scylla_cpu_sharding_algorithm | 1703167821320186 | text
system | local | scylla_msb_ignore | 1703167821320188 | int
system | local | scylla_nr_shards | 1703167821320187 | int
(5 rows)
We have DESC SCHEMA WITH INTERNALS which is supposed to add dropped columns.
Does it work? I tried to run those commands viq clqsh with Scylla 5.4.0 with raft enabled and the output didn't contain dropped columns info.
The test test/cql-pytest/test_describe.py::test_desc_table_internals does ALTER TABLE {tbl} DROP b and then verifies that the output of DESC TABLE WITH INTERNALS contains the string ALTER TABLE {tbl} DROP b. It also re-adds another column called b, and verifies now we see both the add and the drop in the desc output. So apparently it should work. I don't know what is different in your use case. It's very strange and worrying.
Are you using a modern cqlsh which uses the server-side DESC command, or Scylla older cqlsh? Note that the test I mentioned above doesn't use cqlsh - it runs "DESC TABLE" as a CQL command, not client-side code.
Are you using a modern cqlsh which uses the server-side DESC command, or Scylla older cqlsh? Note that the test I mentioned above doesn't use cqlsh - it runs "DESC TABLE" as a CQL command, not client-side code.
I'm using the cqlsh that is shipped alongside Scylla 5.4.0 docker image.
But ALTER TABLE ks.t DROP cf doesn't fully restore the state of system_schema.dropped_columns, because it has a different drop timestamp. Assume the following scenario:
- create table (a, b)
- insert some data (a, b)
- drop column b
- create column b
- insert some data (a, b)
- make a backup (save
DESCRIBE SCHEMAand user sstables)
IIUC restoring schema from DESCRIBE SCHEMA would either:
- with drop stmt - leave column b as dropped - not good
- without drop stmt - leave column b as it has never been dropped - when restoring user sstables we can see data resurrection (values inserted to b column before it was dropped are back to live)
Is this correct?
In your case, DESC TABLE WITH INTERNALS should (and the test verifies that it does) generate the commands
ALTER TABLE {tbl} DROP b
ALTER TABLE {tbl} ADD b int
in the output, in this order. There are no timestamps, but whoever repeats these commands in this order will reach a situation like you wanted to reach - we have the knowledge that b existed, but was dropped. Isn't this exactly what we need?
But I'm not a DESC expert (maybe @mykaul can say who is, I lost track) so please take what I write with a grain of salt.
in the output, in this order. There are no timestamps, but whoever repeats these commands in this order will reach a situation like you wanted to reach - we have the knowledge that b existed, but was dropped. Isn't this exactly what we need?
I think that the problem here is that contents of restored user sstables have lower timestamp then the DROP and ADD statements. So this would result in dropping all the data from column b - which is incorrect from the POV of the cluster state at the time of a backup.
I think that the problem here is that contents of restored user sstables have lower timestamp then the DROP and ADD statements. So this would result in dropping all the data from column b - which is incorrect from the POV of the cluster state at the time of a backup.
As far as I know there is no relationship between the timestamps in the data and the time of drop or add. In any case, the user can't re-add a dropped column with a different type than it used to have. That's the main reason why WITH INTERNALS exist - to add this memory of the type of column that was dropped. I don't think we need to save the timestamp of when a drop happened - the user can even add and drop and add and drop the same column 10 times, and all that matters is what this type was (even if at this moment, the last statement was a drop).
@avelanarius may know, but he's OOO - @roydahan - anyone else can answer the question from https://github.com/scylladb/scylladb/issues/16482#issuecomment-1866448261 (perhaps @fruch , as the question is on cqlsh support for server side describe via CQLSH)
@avelanarius may know, but he's OOO - @roydahan - anyone else can answer the question from #16482 (comment) (perhaps @fruch , as the question is on cqlsh support for server side describe via CQLSH)
none of us is an expert for the the DESC client side one can open the code and take a look
anyhow at point when someone would get to https://github.com/scylladb/scylla-cqlsh/issues/17 it's would be coming out of scylla core core, where I know even less how it works.
@Michal-Leszczynski I disagree with bug title, though. It should state something like "Provide a way to restore a single keyspace from an incremental backup as per our documentation procedure".
So the problem basically is that when you follow the docs and yout keyspace has tables with dropped columns, restore from the incremental backup doesn't work. Perhaps do a major compaction after dropping the column, would that work as a viable workaround?
(I don't disagree it's an issue, but I am trying to grasp the urgency of it).
DESCRIBE is part of server-side grammar since 2022: commit c563b2133ccd63cffa99b207414be605e159fe64 Author: Michał Jadwiszczak <> Date: Wed Nov 9 10:35:25 2022 +0100
cql3: Extend CQL grammar with server-side describe statement
So perhaps indeed we only need to extend its output with dead columns...
@Michal-Leszczynski I disagree with bug title, though. It should state something like "Provide a way to restore a single keyspace from an incremental backup as per our documentation procedure".
So the problem basically is that when you follow the docs and yout keyspace has tables with dropped columns, restore from the incremental backup doesn't work. Perhaps do a major compaction after dropping the column, would that work as a viable workaround?
(I don't disagree it's an issue, but I am trying to grasp the urgency of it).
The context is the scylla-manger restore feature. It's currently restoring the schema by sstables, and not by CQL, it's causing multiple issues, and people keep pointing the manger team to switch to use CQL for restoring the schema.
But this issue is for stating that also using CQL is problematic
And, scylla-manager can't force users to do major compactions. so not sure it's a viable workaround (or one might suggest Scylla would automatically do major compaction on those changes)
But
ALTER TABLE ks.t DROP cfdoesn't fully restore the state ofsystem_schema.dropped_columns, because it has a different drop timestamp. Assume the following scenario:
- create table (a, b)
- insert some data (a, b)
- drop column b
- create column b
- insert some data (a, b)
- make a backup (save
DESCRIBE SCHEMAand user sstables)IIUC restoring schema from
DESCRIBE SCHEMAwould either:
- with drop stmt - leave column b as dropped - not good
- without drop stmt - leave column b as it has never been dropped - when restoring user sstables we can see data resurrection (values inserted to b column before it was dropped are back to live)
Is this correct?
A DESCRIBE SCHEMA which would make it safe for restore would have to include column ids not just the names.
It should state something like "Provide a way to restore a single keyspace from an incremental backup as per our documentation procedure".
Why would this issue target only incremental backup? I think that restoring a snapshot (even without incremental backup files) have the same problem.
So this example (tested with Scylla 5.4.0 and raft schema) should illustrate the reason why even when DESC SCHEMA includes the ADD/DROP cf it can't be used by SM when restoring backup.
Backup procedure:
- create table tab (id int, a text)
- insert into tab (1, 'a')
- alter table tab drop a
- alter table tab add a text
- insert into tab (2, 'b')
- make a backup consisting of
DESC SCHEMAand tab sstables
At this point the contents of tab looks like:
- (1, null) - because a was dropped when this row was already in the table
- (2, 'b')
Restore procedure (on a different cluster):
- restore schema via saved
DESC SCHEMA(CREATE TABLE->DROP a->ADD a) - upload tab sstables to cluster via L&S
- repair cluster
At this point the contents of tab looks like:
- (1, null)
- (2, null) - because
DROPfrom restored schema has a newer timestamp than the insert of this row in the original cluster (at least that's my interpretation)
So this example shows that restored cluster have different rows than the backed-up cluster and that's the reason why SM can't currently use DESC SCHEMA even with DROP/ADD statements for restoring schema that could correctly interpret contents of the backup
* (2, null) - because `DROP` from restored schema has a newer timestamp than the insert of this row in the original cluster (at least that's my interpretation)
According to this experiment, I may have misunderstood what exactly happens when you drop a column. I thought (incorrectly?) that the timestamp of the DROP or ADD operation has no relation at all to the timestamp of the data: Scylla's "reconciliation" code, which reads data from multiple sstables and multiple nodes and decides what the final output will be, does not look at schema timestamp - it only looks at the current schema at the time of the read or compaction. However, my understanding does not fit your experiment. Do you (or anyone else reading this) know if load-and-stream has some special reconciliation logic which is different from what normal Scylla does?
Do you (or anyone else reading this) know if load-and-stream has some special reconciliation logic which is different from what normal Scylla does?
I don't think so. Here is a fragment of conversation on connected issue:
I agree with Tomek. The reasoning behind gc_grace_seconds is to give the admin time to run repair in that period. It doesn't guarantee anything else. With repair garbage collection mode, things are different. Tombstones won't be purged until a later repair successfully repaired the respective token. We compare the tombstone deleted_at time point against the repair time to determine if repair ran after the tombstone was written or before it. @asias, in load_and_stream mode, do we "refresh" tombstones deletion time to the current time, or do we restore the original time? What about ttl expiration times? When data expires it effectively becomes a tombstone and the expiration time becomes the deletion time. How do we treat that in load_and_stream?
No, load and stream does not rewrite the deletion time. It writes what is read without any modification.
AFAIR
DESCRIBE SCHEMAisn't a part of CQL protocol, but some additional feature provided bycqlsh, so the changes toDESCRIBE SCHEMAwould also need to be applied to corresponding driver methods. E.g. in gocql we haveKeyspaceMetadatawhich works likeDESCRIBE SCHEMAfor a given keyspace:
In more recent ScyllaDB DESCRIBE SCHEMA is implemented server side, so we should focus on fixing that. cqlsh should be adjusted to use server-side DESCRIBE when available.
Note that snapshots also contain a schema dump (generated by server-side DESCRIBE).
AFAIR
DESCRIBE SCHEMAisn't a part of CQL protocol, but some additional feature provided bycqlsh, so the changes toDESCRIBE SCHEMAwould also need to be applied to corresponding driver methods. E.g. in gocql we haveKeyspaceMetadatawhich works likeDESCRIBE SCHEMAfor a given keyspace:In more recent ScyllaDB DESCRIBE SCHEMA is implemented server side, so we should focus on fixing that. cqlsh should be adjusted to use server-side DESCRIBE when available.
Note that snapshots also contain a schema dump (generated by server-side DESCRIBE).
Sorry, I see that was already discussed.
Related issue Apahce Cassansra issues
Add schema to snapshot manifest https://issues.apache.org/jira/browse/CASSANDRA-7190 (part of Apache Cassandra 3.0)
Serialize table schema as a sstable component (open) https://issues.apache.org/jira/browse/CASSANDRA-9587
AFAIU current state of DESCRIBE SCHEMA WITH INTERNALS CANNOT be safely used for restore purposes.
I already presented an example in which dropping and adding ~table~ column before backup might result in broken restored data.
To be fair, I tested it with SM repo setup where there might be some problems with used cql version (to be investigated), but I manually added DROP/ADD statements from expected DESC SCHEMA WITH INTERNALS output that didn't help. I asked @Jadw1 for help with getting correct output of DESC SCHEMA WITH INTERNALS and he got the following format (from Scylla master):
cqlsh:ks1> desc schema with internals;
CREATE KEYSPACE ks1 WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '1'} AND durable_writes = true;
CREATE TABLE ks1.t (
a int,
b int,
PRIMARY KEY (a)
) WITH ID = f9184d80-a963-11ee-866d-29c5c2bc6f78
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1
AND dclocal_read_repair_chance = 0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0
AND speculative_retry = '99.0PERCENTILE'
AND paxos_grace_seconds = 864000
AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
ALTER TABLE ks1.t DROP b USING TIMESTAMP 1704195849017301;
ALTER TABLE ks1.t ADD b int;
It looked promising (inclusion of drop timestamp could perhaps solve the whole problem), but it turns out that ALTER TABLE ... USING TIMESTAMP ... is a syntax error! So the output of DESC SCHEMA WITH INTERNALS seems to be broken as it is and it does not play well with restore after removing the USING TIMESTAMP option.
Also, I don't think that it's just a syntax problem, but rather that currently Scylla internals can't safely handle ALTER TABLE ... USING TIMESTAMP ..., as suggested in this open issue and here by @kbr-scylla:
although I think that it would require to alter schema with provided timestamp
It shouldn't be needed, besides, that is also unsafe to do with Raft-based schema, where we try to ensure that timestamps are monotonically growing consistently with the order of schema change commands committed through Raft. Allowing the user to specify a timestamp would break this invariant, making some later schema changes invisible or worse.
cc: @kostja @mykaul @tzach @karol-kokoszka
related to https://github.com/scylladb/scylladb/issues/12389
@kostja - is this on track to 6.0? We agreed we'll move to this post 5.4. What's the latest?
I think @Jadw1 can take it next after the service levels. The path here is to make sure the syntax printed by DESCRIBE SCHEMA WITH INTERNALS is actually legal. I can't think of a reason why it wouldn't be. If we find any obstacles, we can then decide to reschedule.