cassandra-reaper Reaper don't start with "A health check named cassandra.contrail

Project board link

We have cluster with 3 nodes with cassandra. We are trying to run 3 reaper in the same container with cassandras. we are configuring datacenterAvailability EACH mode, starts the one of the reapers first, the others are waiting for the first one is ready.

We regularly encounter the following problem - the reaper fails with an error ERROR [2023-02-08 13:39:46,407] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly... java.lang.IllegalArgumentException: A health check named cassandra.contrail_database already exists at com.codahale.metrics.health.HealthCheckRegistry.register(HealthCheckRegistry.java:101) at systems.composable.dropwizard.cassandra.CassandraFactory.build(CassandraFactory.java:505) at systems.composable.dropwizard.cassandra.CassandraFactory.build(CassandraFactory.java:447) at io.cassandrareaper.storage.CassandraStorage.(CassandraStorage.java:231) at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:65) at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:472) at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:174) at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:93) at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:59) at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) at io.dropwizard.cli.Cli.run(Cli.java:78) at io.dropwizard.Application.run(Application.java:94) at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:105)

The previous errors in reaper: ERROR [2023-02-08 13:37:44,630] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly... org.cognitor.cassandra.migration.MigrationException: Error during migration of script 016_init_reaper_db.cql while executing 'CREATE TABLE IF NOT EXISTS running_reapers ( reaper_instance_id uuid PRIMARY KEY, last_heartbeat timestamp, reaper_instance_host text ) WITH bloom_filter_fp_chance = 0.1 AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} AND default_time_to_live = 180 AND gc_grace_seconds = 180;' at org.cognitor.cassandra.migration.Database.execute(Database.java:269) at java.util.Collections$SingletonList.forEach(Collections.java:4824) at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68) at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:376) at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:307) at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:265) at io.cassandrareaper.storage.CassandraStorage.(CassandraStorage.java:252) at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:65) at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:472) at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:174) at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:93) at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:59) at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) at io.dropwizard.cli.Cli.run(Cli.java:78) at io.dropwizard.Application.run(Application.java:94) at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:105) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.122.11:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.11:9041] Timed out waiting for server response), /192.168.122.12:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.12:9041] Timed out waiting for server response), /192.168.122.13:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.13:9041] Timed out waiting for server response)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:83) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37) at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:277) at org.cognitor.cassandra.migration.Database.execute(Database.java:261) ... 15 common frames omitted Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.122.11:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.11:9041] Timed out waiting for server response), /192.168.122.12:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.12:9041] Timed out waiting for server response), /192.168.122.13:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.13:9041] Timed out waiting for server response)) at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:283) at com.datastax.driver.core.RequestHandler.access$1200(RequestHandler.java:61) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:375) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.retry(RequestHandler.java:563) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.processRetryDecision(RequestHandler.java:545) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:987) at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1636) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750)

in cassandra: INFO [Native-Transport-Requests-1] 2023-02-08 13:37:08,381 MigrationManager.java:376 - Create new table: org.apache.cassandra.config.CFMetaData@68cf6a94[cfId=aee97cd0-a7b5-11ed-9e8c-bd6e0747cb3d,ksName=reaper_db,cfName=running_reapers,flags=[COMPOUND],params=TableParams{comment=, read_repair_chance=0.0, dclocal_read_repair_chance=0.1, bloom_filter_fp_chance=0.1, crc_check_chance=1.0, gc_grace_seconds=180, default_time_to_live=180, memtable_flush_period_in_ms=0, min_index_interval=128, max_index_interval=2048, speculative_retry=99PERCENTILE, caching={'keys' : 'ALL', 'rows_per_partition' : 'NONE'}, compaction=CompactionParams{class=org.apache.cassandra.db.compaction.LeveledCompactionStrategy, options={}}, compression=org.apache.cassandra.schema.CompressionParams@bdf6c555, extensions={}, cdc=false},comparator=comparator(),partitionColumns=[[] | [last_heartbeat reaper_instance_host]],partitionKeyColumns=[reaper_instance_id],clusteringColumns=[],keyValidator=org.apache.cassandra.db.marshal.UUIDType,columnMetadata=[reaper_instance_id, last_heartbeat, reaper_instance_host],droppedColumns={},triggers=[],indexes=[]] INFO [MigrationStage:1] 2023-02-08 13:37:47,012 ColumnFamilyStore.java:411 - Initializing reaper_db.running_reapers ERROR [MigrationStage:1] 2023-02-08 13:37:55,565 CassandraDaemon.java:228 - Exception in thread Thread[MigrationStage:1,5,main] org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found b611ea60-a7b5-11ed-8d38-017e52d8b7b5; expected aee97cd0-a7b5-11ed-9e8c-bd6e0747cb3d) at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:941) at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:895) at org.apache.cassandra.config.Schema.updateTable(Schema.java:687) at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1474) at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1430) at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1399) at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1376) at org.apache.cassandra.db.DefinitionsUpdateVerbHandler$1.runMayThrow(DefinitionsUpdateVerbHandler.java:51) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750)

After errors nothing helps except drop reaper_db key space and restart. The most of times after it, it run successfully.

Before start reaper is waiting Cassandra to be up on jmx and cql ports, then another 15 seconds.

It is reproducible on our setup every time.

What can be done to prevent it? cassandra-reaper.yaml.txt debug.log.txt reaper.log.txt system.log.txt

Feb 09 '23 10:02 tikitavi

[root@control1 /]# cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.4 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.4" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.4 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

cqlsh> show VERSION [cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]

REAPER_VERSION=3.2.1

Feb 10 '23 10:02 tikitavi

Is anybody here?

Feb 14 '23 10:02 tikitavi

Is anybody here?

Yes, many people 🙂

Your stack traces show that you have timeouts in the schema migration:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.122.11:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.11:9041] Timed out waiting for server response), /192.168.122.12:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.12:9041] Timed out waiting for server response), /192.168.122.13:9041 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.122.13:9041] Timed out waiting for server response))

The error you're seeing in Cassandra shows that your instances have a schema mismatch:

org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found b611ea60-a7b5-11ed-8d38-017e52d8b7b5; expected aee97cd0-a7b5-11ed-9e8c-bd6e0747cb3d)

This means there have been conflicting schema updates for the same table. The health check message is just a side effect of the previous failures.

If you're collocating Reaper with Cassandra (each Cassandra node has its own Reaper instance) then I'll recommend to use SIDECAR instead of EACH. To recover from this problem, you'll need to stop the reaper instances, drop the reaper_db keyspace and try again. I'm not entirely sure how you got to this point as there is leader election happening in the migration code to prevent concurrent table creations...

You can also use the schema migration mode if you want to have better control over the migrations: https://github.com/thelastpickle/cassandra-reaper/blob/master/src/server/src/main/docker/entrypoint.sh#L54-L67

Feb 14 '23 10:02 adejanovski

Thanks! Each cassandra has its owns reaper, but the Cassandras are on cluster and use the same base, will Reaper in SIDECAR support it? Should I set up cassandra.contactPoints to all three Cassandras or to the local one?

Feb 14 '23 10:02 tikitavi

And one more question, if it is some timeouts settings for migration, may be I can increase them? @adejanovski

Feb 14 '23 12:02 tikitavi

And one more question, if it is some timeouts settings for migration, may be I can increase them?

It's not a real timeout, your migration is failing due to a table id mismatch.

will Reaper in SIDECAR support it? Yes. EACH is made for multi DC setups. The simplest thing you can do is set it to ALL, if all Reaper instances can access all the Cassandra nodes through JMX directly. If not, then SIDECAR is what you need.

Should I set up cassandra.contactPoints to all three Cassandras or to the local one?

The local one will do, but it'll work either way. The SIDECAR part really affects JMX communications, not CQL communications.

Feb 14 '23 12:02 adejanovski

I've had assumptions that may be reaper send request to cassandra to create/alter table, but didn't get any response, timeouted and that's why 'no hosts' error appeared, and then on retry happens mismatch. In this case timeouts may help, I suppose.

Feb 14 '23 14:02 tikitavi

@tikitavi were you able to fix it? I'm able to consistently reproduce this problem.

Apr 19 '23 04:04 npnwdev

@tikitavi @adejanovski we see this frequently on many of our clusters as well (running in sidecar mode). dropping/recreating the keyspace usually works but repair schedules and runs are lost.

would it be beneficial to introduce a startup flag that would forcefully skip the normal database migration logic?

thanks!

Apr 21 '23 17:04 denniskline

I've encountered very similar error. I have different first exception, but the same second repeated infinitely.

Look, we have a loop ReaperApplication, where we attempt to initialize storage until it will do it without exception.

Valid approach, but there is the problem:

During the iteration it creates metric not deep inside cassandraFactory.build(environment), but then, few lines later it call initializeAndUpgradeSchema which could also fail In my case it fails with

com.datastax.driver.core.exceptions.NoHostAvailableException

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.56.20.29:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query
at consistency SERIAL (2 required but only 1 alive)))
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:83)
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
        at org.cognitor.cassandra.migration.Database.removeLeadOnMigrations(Database.java:370)
        at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:75)
        at io.cassandrareaper.storage.CassandraStorage.migrate(CassandraStorage.java:380)
        at io.cassandrareaper.storage.CassandraStorage.initializeCassandraSchema(CassandraStorage.java:311)
        at io.cassandrareaper.storage.CassandraStorage.initializeAndUpgradeSchema(CassandraStorage.java:269)
        at io.cassandrareaper.storage.CassandraStorage.<init>(CassandraStorage.java:256)
        at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:65)
        at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:474)
        at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:175)
        at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:94)
        at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:59)
        at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
        at io.dropwizard.cli.Cli.run(Cli.java:78)
        at io.dropwizard.Application.run(Application.java:94)
        at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:106)

, in topic starter's case it failed with org.cognitor.cassandra.migration.MigrationException, but it really doesn't matter, because it can fail with any other and it will then behave the same.

The second attempt will try to create metric again and it will repeat forever.

Jun 21 '23 15:06 kkolyan

We're observing the same issue in our environments. The first exception is caused by the schema agreement timeout firing during the migration:

org.cognitor.cassandra.migration.MigrationException

ERROR  [2024-02-10 10:28:22,612] [main] i.c.ReaperApplication - Storage is not ready yet, trying again to connect shortly... 
org.cognitor.cassandra.migration.MigrationException: Error during migration of script 024_node_metrics_v3_partitioning.cql while executing 'DROP TABLE IF EXISTS node_metrics_v2;'
	at org.cognitor.cassandra.migration.Database.execute(Database.java:269)
	at java.util.Collections$SingletonList.forEach(Collections.java:4824)
	at org.cognitor.cassandra.migration.MigrationTask.migrate(MigrationTask.java:68)
	at io.cassandrareaper.storage.cassandra.MigrationManager.migrate(MigrationManager.java:171)
	at io.cassandrareaper.storage.cassandra.MigrationManager.initializeCassandraSchema(MigrationManager.java:101)
	at io.cassandrareaper.storage.cassandra.MigrationManager.initializeAndUpgradeSchema(MigrationManager.java:59)
	at io.cassandrareaper.storage.cassandra.CassandraStorageFacade.<init>(CassandraStorageFacade.java:153)
	at io.cassandrareaper.storage.InitializeStorage.initializeStorageBackend(InitializeStorage.java:66)
	at io.cassandrareaper.ReaperApplication.tryInitializeStorage(ReaperApplication.java:474)
	at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:175)
	at io.cassandrareaper.ReaperApplication.run(ReaperApplication.java:94)
	at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:59)
	at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
	at io.dropwizard.cli.Cli.run(Cli.java:78)
	at io.dropwizard.Application.run(Application.java:94)
	at io.cassandrareaper.ReaperApplication.main(ReaperApplication.java:106)
Caused by: org.cognitor.cassandra.migration.MigrationException: Schema agreement could not be reached. You might consider increasing 'maxSchemaAgreementWaitSeconds'.
	at org.cognitor.cassandra.migration.Database.executeStatement(Database.java:281)
	at org.cognitor.cassandra.migration.Database.execute(Database.java:261)
	... 15 common frames omitted

This seems to be a recoverable case, and a retry is expected to help here.

But then the retry loop (specified by @kkolyan) never succeeds as it attempts to register the CassandraHealthCheck that has already been registered - this results in the "A health check ... already exists" exception.

Used versions:

cassandra: 4.1.3
cassandra-reaper: 3.3.4

Feb 14 '24 13:02 sealexer

cassandra-reaper
cassandra-reaper copied to clipboard

Reaper don't start with "A health check named cassandra.contrail_database already exists"

cassandra-reaper cassandra-reaper copied to clipboard

Reaper don't start with "A health check named cassandra.contrail_database already exists"

cassandra-reaper
cassandra-reaper copied to clipboard