cassandra-medusa icon indicating copy to clipboard operation
cassandra-medusa copied to clipboard

[WIP] Sanity checking around backing up files

Open kiddom-kq opened this issue 3 years ago • 8 comments

While trying to deploy Medusa, i ran into a few issues. The root cause of #390 was an issue file systems permissions and the silent-failure property of python's glob()

This PR implements some of the debug logging that I wish I had had while troubleshooting and a basic sanity check to abort execution as soon as an error is observed rather than waiting for a (misleading) failure at a later point in execution.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1398 ┆priority: Medium

kiddom-kq avatar Aug 24 '21 17:08 kiddom-kq

Would ValueError be more appropriate than the base Exception that's being raised? I didn't see too many custom exception types in the Medusa code base and of the few, none really fit this case. There are several places in the code bases where raise Exception(error_msg) is used so i'm not sure if there's some lint-disable comment i can leave there to prevent SonarCloud from objecting.

kiddom-kq avatar Aug 24 '21 18:08 kiddom-kq

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

sonarqubecloud[bot] avatar Aug 30 '21 16:08 sonarqubecloud[bot]

@kiddom-kq, it looks like this broke the integration tests. You can run them locally using ./run_integration_tests.sh, with a prereq to have ccm installed (pip install ccm). Let me know if you need some assistance to debug this.

adejanovski avatar Sep 02 '21 09:09 adejanovski

Would ValueError be more appropriate than the base Exception that's being raised? I didn't see too many custom exception types in the Medusa code base and of the few, none really fit this case. There are several places in the code bases where raise Exception(error_msg) is used so i'm not sure if there's some lint-disable comment i can leave there to prevent SonarCloud from objecting.

Any improvement getting a more specific use of exceptions is welcome. Feel free to make any change necessary.

adejanovski avatar Sep 06 '21 09:09 adejanovski

@adejanovski

I am having some trouble understanding the test.

 ./run_integration_tests.sh --test=16  --cassandra-version=2.2.19 -vv

That returns a failure because the assert statement is thrown.

What is supposed to happen when find_dirs() has no dirs?

Looking at the test:

    Scenario Outline: Perform a differential backup over gRPC , verify its index, then delete it over gRPC with Jolokia
        Given I have a fresh ccm cluster with jolokia "<client encryption>" running named "scenario16"
        Given I am using "<storage>" as storage provider in ccm cluster "<client encryption>" with gRPC server
        Then the gRPC server is up
        When I create the "test" table in keyspace "medusa"
        When I load 100 rows in the "medusa.test" table
        When I run a "ccm node1 nodetool flush" command
        When I perform a backup over gRPC in "differential" mode of the node named "grpc_backup"
        Then the backup index exists
        Then I verify over gRPC that the backup "grpc_backup" exists
        Then I can see the backup index entry for "grpc_backup"
        Then I can see the latest backup for "127.0.0.1" being called "grpc_backup"
        Then I delete the backup "grpc_backup" over gRPC
        Then I verify over gRPC the backup "grpc_backup" does not exist
        Then I shutdown the gRPC server

It looks like When I perform a backup over gRPC in "differential" mode of the node named "grpc_backup" is the first time that a backup is taken of the scenario16 cluster. If that is correct, how can a table with 100 rows in it have no directories that need to be backed up?

When I disable the assert, the execution continues to call add_backup_finish_to_index(storage, node_backup) and then set_latest_backup_in_index(storage, node_backup)

Those functions put a record of the backup happening in the storage/index... even though nothing was backed up. The next line of the test: Then the backup index exists would see that something was added to the index and the test continues on happy to delete the (empty) backup that it just made.

Can you confirm that the differential backup should return 0 directories?

kiddom-kq avatar Sep 07 '21 17:09 kiddom-kq

@kiddom-kq, differential backups put the sstables into a data directory that's at the same level than the backup directories. So you should have something like this: Capture d’écran 2021-10-07 à 15 57 54

While full backups use this layout: Capture d’écran 2021-10-07 à 15 58 28

Those 100 rows should definitely be backed up in the data directory.

adejanovski avatar Oct 07 '21 13:10 adejanovski

Hi @kiddom-kq, is this still something you're working on?

adejanovski avatar Mar 25 '22 16:03 adejanovski

ping @kiddom-kq

rzvoncek avatar Jun 17 '24 12:06 rzvoncek