scylla-tools-java sstableloader fails to restore backup files created by scylla manager

Version:

sstableloader: 8080009794bb628380dbfed27e02cd2e653078b4

Scylla Cluster - Source Destination:

scylla-4.2.3-0.20210104.24346215c2.x86_64
scylla-conf-4.2.3-0.20210104.24346215c2.x86_64
scylla-debuginfo-4.2.3-0.20210104.24346215c2.x86_64
scylla-env-1.1-1.el7.noarch
scylla-jmx-4.2.3-20210105.e351b009da.noarch
scylla-kernel-conf-4.2.3-0.20210104.24346215c2.x86_64
scylla-libatomic73-7.3.1-1.2.el7.centos.x86_64
scylla-libgcc73-7.3.1-1.2.el7.centos.x86_64
scylla-libstdc++73-7.3.1-1.2.el7.centos.x86_64
scylla-manager-agent-2.2.2-0.20201223.32fed80f.x86_64
scylla-python3-3.8.5-0.20210104.24346215c2.x86_64
scylla-server-4.2.3-0.20210104.24346215c2.x86_64
scylla-tools-4.2.3-20210105.9d11db58267.noarch
scylla-tools-core-4.2.3-20210105.9d11db58267.noarch

Scylla Cluster - Backup Destination:

scylla-4.2.3-0.20210104.24346215c2.x86_64
scylla-conf-4.2.3-0.20210104.24346215c2.x86_64
scylla-debuginfo-4.2.3-0.20210104.24346215c2.x86_64
scylla-jmx-4.2.3-20210105.e351b009da.noarch
scylla-kernel-conf-4.2.3-0.20210104.24346215c2.x86_64
scylla-machine-image-4.2.3-20210105.ab5b881b69f.noarch
scylla-manager-agent-2.2.2-0.20201223.32fed80f.x86_64
scylla-python3-3.8.5-0.20210104.24346215c2.x86_64
scylla-server-4.2.3-0.20210104.24346215c2.x86_64
scylla-tools-4.2.3-20210105.9d11db58267.noarch
scylla-tools-core-4.2.3-20210105.9d11db58267.noarch

sstableloader

Command: ./bin/scylla-sstableloader --verbose 4 --no-prepared -d '<list of ips>' /var/lib/scylla/data/temp/ks/cf/<location of sstables files>

Error:

CQL: 'DELETE FROM "ks"."cf" USING TIMESTAMP :timestamp WHERE "table_name" = :v2 AND "start_time" > :v3 AND "start_time" <= :v5 AND "forecast_id" <= :v4;' {v2=472_1604188800000_1606780800000, v3=Mon Nov 02 07:00:00 UTC 2020, v4=8e7ba400-1b85-11eb-8080-808080808080, timestamp=1604344688326124, v5=Sun Nov 01 07:00:00 UTC 2020}
java.lang.AssertionError
	at com.scylladb.tools.SSTableToCQL$RowBuilder.setWhere(SSTableToCQL.java:400)
	at com.scylladb.tools.SSTableToCQL$RowBuilder.deleteCqlRow(SSTableToCQL.java:449)
	at com.scylladb.tools.SSTableToCQL$RowBuilder.processRangeTombstoneEndBound(SSTableToCQL.java:892)
	at com.scylladb.tools.SSTableToCQL$RowBuilder.process(SSTableToCQL.java:842)
	at com.scylladb.tools.SSTableToCQL$RowBuilder.process(SSTableToCQL.java:820)
	at com.scylladb.tools.SSTableToCQL$RowBuilder.access$500(SSTableToCQL.java:137)
	at com.scylladb.tools.SSTableToCQL.run(SSTableToCQL.java:981)
	at com.scylladb.tools.BulkLoader.process(BulkLoader.java:1540)
	at com.scylladb.tools.BulkLoader.lambda$main$1(BulkLoader.java:1364)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
  8% done.        0 statements sent (in      218 batches,        0 failed).
       0 statements generated.
       0 cql rows processed in        0 partitions.
       0 cql rows and        0 partitions deleted.
       0 local and        0 remote counter shards where skipped.

Table Schema:

CREATE TABLE ks.cf (
    table_name text,
    start_time timestamp,
    forecast_id timeuuid,
    multioutput_params frozen<list<text>>,
    values frozen<list<frozen<list<double>>>>,
    PRIMARY KEY (table_name, start_time, forecast_id)
) WITH CLUSTERING ORDER BY (start_time DESC, forecast_id DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy', 'min_threshold': '2'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 315360000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

General description of issue:

We're attempting to restore keyspace/tables from a scylla-manager performed backups from one cluster. Both clusters are running the same version of Scylla and the topology is identical: 5 nodes to 5 nodes. The sstableloader is being executed from a management node separate from the cluster after.

We have noticed that manually running compactions on the sstables on the source cluster does seem assist in removing the error, but fear this is not a tenable solution.

Jan 25 '21 21:01 jtnagashima

same for me, but I tried to import the snapshot made from cassandra

Jan 27 '21 06:01 vodovozovge

The assertion itself, with the given stack trace means that you have a range tombstone that has low bound == high bounds and an exclusive range, i.e. it is akin to a 'where x > 10 && x < 10' (or maybe 'where x > 10 && x =< 10' - the assert does not check as such). The point being that at the time of writing, these bounds where only supposed to be generated for an 'x == 10' equivalence. Without the failing sstables it is hard to say exactly what is at fault here. I guess sstableloader can be fixed to handle these rather non-sequitor files (maybe just ignore the tombstone?) - but I would like to verify the sanity of the file(s) first. It is not strange that compaction fixes the issue, since the tombstone seems nonsensical - it should be removed by pruning.

Jan 27 '21 09:01 elcallio

Hi @elcallio! Thanks for the input. I would agree that having the sstableloader handle such cases would be the ideal outcome as not all users will be able to avoid edge cases in their clusters. I'm imagining a scenario where a keyspace becomes corrupted for whatever reason and the operator has to restore from backups only to be faced with this exception because of something outside their control.

Jan 27 '21 20:01 jtnagashima

Can you provide sstables (+ schemas) with these tombstones? As I said, I would very much like to look at the data before committing a fix.

Feb 01 '21 08:02 elcallio

Sure, I'll find you on Scylla slack to figure out the best way to get you the data.

Feb 01 '21 19:02 jtnagashima

Please follow the guidelines in https://docs.scylladb.com/troubleshooting/report_scylla_problem/ - for how to upload information

Thanks for the help

On Mon, Feb 1, 2021 at 9:57 PM Jason Nagashima [email protected] wrote:

Sure, I'll find you on Scylla slack to figure out the best way to get you the data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-tools-java/issues/218#issuecomment-771117467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2OCCDZDCCPOFGQ27U4WODS44BR3ANCNFSM4WSL4IXA .

Feb 01 '21 20:02 slivne

Files uploaded; uuid: 9DA9BFDF-5F63-4FE1-AA1A-524DB17C8C14

Included in archive:

README.txt
md-7002-big-* (sstable and related files)
schema.cql (schema of the keyspace/table)

Feb 01 '21 21:02 jtnagashima

So this is what I was afraid of. The tombstone translates to basically this:

'DELETE FROM "markets_dev"."sondes_forecast" USING TIMESTAMP :timestamp 
    WHERE table_name = :v2 AND start_time > :v3 AND start_time <= :v5 
         AND sondes_forecast_id > :v4 AND sondes_forecast_id <= :v6;'

{v6=b35c9a00-d6b0-11ea-8080-808080808080, v2=1318_1596240000000_1598918400000, v3=Wed Aug 05 00:00:00 UTC 2020, v4=b35c9a00-d6b0-11ea-8080-808080808080, timestamp=1599165982948007, v5=Wed Aug 05 00:00:00 UTC 2020}

Note that this is a pointless tombstone. v3 an v5 are equal So the condition cannot be true. But it is also invalid CQL. You cannot have multiple range restrictions. Only the last one can be <> or similar. So I don't know how to deal with this. Should we just skip this sstable entry? Say "cannot match anything anyway so booh"?

Feb 02 '21 14:02 elcallio

Btw. Also note that the sondes_forecast_id restriction is similarly pointless. Must double check that this data is really what is in the sstable so it is not the java parser being flakey...

Feb 02 '21 14:02 elcallio

Nope.

{
        "type" : "range_tombstone_boundary",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2020-08-17 21:00:00.000Z", "b56e2200-e0ce-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-03T20:46:22.948007Z", "local_delete_time" : "2020-09-03T20:46:22Z" }
        },
        "end" : {
          "type" : "inclusive",
          "clustering" : [ "2020-08-17 21:00:00.000Z", "b56e2200-e0ce-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-03T20:46:22.948007Z", "local_delete_time" : "2020-09-03T20:46:22Z" }
        }
      },

What the monkey...

Feb 02 '21 14:02 elcallio

@tgrabiec - any idea where these weird tombstones came from? Or how to deal with them?

Feb 02 '21 14:02 elcallio

@elcallio range_tombstone_boundary is a new range tombstone atom introduced in the MC format. It represents a boundary between adjacent range tombstones, as if the end of the previous one and the start of the next one were combined. The clustering position is shared.

It can be either inclusive_exclusive, or exclusive_inclusive, which determines if the next range tombstone starts before or after the boundary. If the start boundary is inclusive, the next range tombstone includes the clustering position, the previous one excludes it.

Feb 02 '21 17:02 tgrabiec

You are of course correct, and the code does already handle this as it should (or almost maybe). I am citing the wrong entry. The problem remains that we either miss one marker somewhere and get out of sync, or we still have the case of start/boundary or two boundaries with crossed points.

Feb 02 '21 17:02 elcallio

So, to formulate the issue better. This is the offending tombstone (start):

"rows" : [
      {
        "type" : "range_tombstone_bound",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2020-08-30 16:00:00.000Z", "*" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        }
      },
      {
        "type" : "range_tombstone_boundary",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2019-11-30 23:00:00.000Z", "3b263200-13c7-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        },
        "end" : {
          "type" : "inclusive",
          "clustering" : [ "2019-11-30 23:00:00.000Z", "3b263200-13c7-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        }
      },
      {
        "type" : "range_tombstone_boundary",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2019-11-30 22:00:00.000Z", "d961ca00-13be-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        },
        "end" : {
          "type" : "inclusive",
          "clustering" : [ "2019-11-30 22:00:00.000Z", "d961ca00-13be-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        }
      },

We have two issues here:

1.) You cannot express CQL where as two clustering <> relations. We effectively have (for the first part)

DELETE FROM "markets_dev"."sondes_forecast" USING TIMESTAMP :timestamp WHERE table_name = :v2 AND start_time > :v3 AND start_time <= :v5 AND sondes_forecast_id <= :v4

This is not kosher. All cluster prefixes need to be Eq until last one which can be <>.

And 2.) The second range_tombstone_boundary has a value for the range end (from first one) that is less than start value. I.e. start value 2019-11-30 23:00:00.000Z, end value 2019-11-30 22:00:00.000Z. This confuses the sstableloader, which identifies this as shenanigans that sstable 2.x used to play sometimes with overlapping tombstones, and does things that make us end up trying to effectively process the range_tombstone_boundary as an actual tomstone range -> assert.

I am at a loss at what to do here. I can obviously simply bypass the overlap detection if we use boundaries and assume those never have those unintentional borks. But there seem to be some other unintentional bork here as well... And I most certainly don't know how to translate these multiple clustering ranges into functioning CQL.

Feb 03 '21 09:02 elcallio

@elcallio The schema defines descending order for the first clustering key part, so there is no violation of clustering key order.

Feb 03 '21 10:02 tgrabiec

@dekimir You're the last one who touched CQL restrictions. How hard would it be to lift the limitation that Calle mentions: "All cluster prefixes need to be Eq until last one which can be <>.". We want to express "ck > (a, *) and ck < (b, c)" where * means null.

Feb 03 '21 10:02 tgrabiec

Fair enough, that means there is some info not reaching the comparison in the origin java code.

Feb 03 '21 10:02 elcallio

So this gets funny. Seems the driver (at least the version we use in scylla-tools-java) does not pick up on the "DESC" part of clustering order in the columns schema table. Because we have "DESC" (upper case) and it compares this (case sensitive) with "desc" (lower case).

Feb 03 '21 13:02 elcallio

Not sure if we should fix this in scylla, driver or both. Now, if I manually alter schema tables so the driver gives me correct info, I get a bit further, though still of course generating invalid CQL. But then we get to this happy camper:

      {
        "type" : "range_tombstone_boundary",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2019-11-28 23:00:00.000Z", "e652b200-1234-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        },
        "end" : {
          "type" : "inclusive",
          "clustering" : [ "2019-11-28 23:00:00.000Z", "e652b200-1234-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        }
      },
      {
        "type" : "range_tombstone_boundary",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ "2019-11-28 23:00:00.000Z", "e652b200-1234-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-01T18:03:45.259189Z", "local_delete_time" : "2020-09-01T18:03:45Z" }
        },
        "end" : {
          "type" : "inclusive",
          "clustering" : [ "2019-11-28 23:00:00.000Z", "e652b200-1234-11ea-8080-808080808080" ],
          "deletion_info" : { "marked_deleted" : "2020-09-02T17:32:15.317660Z", "local_delete_time" : "2020-09-02T17:32:15Z" }
        }
      },

A boundary that literally says > x and <= x...

Feb 03 '21 14:02 elcallio

@elcallio It could be interpreted as an empty range.

Feb 03 '21 14:02 tgrabiec

We want to express "ck > (a, *) and ck < (b, c)" where * means null.

@tgrabiec Isn't that already possible?

cqlsh:ks> CREATE TABLE t2 ( p int, c1 int, c2 int, c3 int, primary key (p, c1, c2, c3));
cqlsh:ks> DELETE FROM t2 WHERE p=1 AND (c1)>(0) AND (c1,c2)<(9,9);
cqlsh:ks>

Feb 03 '21 14:02 dekimir

@dekimir 1.) Is that still valid even if the conditions are separated? 2.) The case is actually more c1,c2 <> (x,y), i.e. DELETE FROM t2 WHERE p=1 AND (c1)>(0) AND (c1) <(12) AND c2>9 and c2 < 45; - which we apparently puke on.

Feb 03 '21 14:02 elcallio

@elcallio Does that mean you DON'T want to delete row (p=1,c1=1,c2=8)? That's not how I interpreted those range tombstones.

Feb 03 '21 15:02 dekimir

@dekimir We have multiple CK ranges describing tombstones, each of the type "greater than x and less than y". This makes the statement parser vomit. Again, the typical statement in this case is:

'DELETE FROM "markets_dev"."sondes_forecast" USING TIMESTAMP :timestamp 
    WHERE table_name = :v2 AND start_time > :v3 AND start_time <= :v5 
         AND sondes_forecast_id > :v4 AND sondes_forecast_id <= :v6;'

(CK is start_time, sondes_forecast_id)

Feb 08 '21 08:02 elcallio

@dekimir We have multiple CK ranges describing tombstones, each of the type "greater than x and less than y".

I can see that you have column ranges, but that's not the same as CK ranges. Any continuous CK range can be represented by a single valid CQL WHERE expression (either a multi-column slice or a conjunction of single-column restrictions on a CK prefix all of which are EQ/IN except possibly the last one).

If you have multiple CK ranges, you will need multiple CQL statements. This is dictated by the storage_proxy API and will take a long time to change.

Feb 08 '21 15:02 dekimir

Just to make sure I also answered this one:

@dekimir 1.) Is that still valid even if the conditions are separated?

No, but there is no condition separation that's equivalent to that multi-column slice.

Feb 08 '21 15:02 dekimir

Sorry, I meant CK range of multiple columns, each with a range expression. And the column ranges cannot be expressed as IN/EQ, because they are ranges in sstable. Maybe the expression creating them was IN/EQ, but that info is long gone. As is any info on potential value substitutions. Thus the problem.

Feb 08 '21 15:02 elcallio

Sorry, I meant CK range of multiple columns, each with a range expression.

That sounds like multiple CK ranges to me. If we can devise a reliable algorithm to translate your single-column restrictions into a set of equivalent CK ranges, then it's probably better to implement that algorithm inside Scylla's delete_statement than in sstable_loader.

Feb 08 '21 15:02 dekimir

You can literally see the clustering in question in the snippets from sstabledump above. It is CK ranges of multiple boundary values in GT/LT relations.

Feb 08 '21 15:02 elcallio

What I see in the snippets above can be trivially represented by multi-column restrictions. Each range_tombstone_boundary translates to (start_time, sondes_forecast_id) > (lb1,lb2) AND (start_time, sondes_forecast_id) <= (ub1,ub2) (assuming ascending clustering order for both columns). I don't understand why you need single-column restrictions.

Feb 08 '21 15:02 dekimir