cassandra-medusa Failed to find uploaded object

Project board link

We are running medusa on a single-rack cluster of 9 nodes. After running full cluster backups, medusa fails after a long time. We noticed that Medusa always fails on the same object (300GB). I can confirm the object does exist in the AWS-S3 bucket but not sure why Medusa says that failed to find the uploaded object. Could you please help ? This is a production cluster. We are using the 3.11.2 version of Cassandra.

[2022-11-18 22:18:30,998] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-29511-big-Data.db (30.297GiB) [2022-11-18 22:18:31,091] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-29511-big-Index.db (50.202MiB) [2022-11-18 22:19:27,206] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-29511-big-Summary.db (4.068KiB) [2022-11-18 22:19:27,253] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-29511-big-Digest.crc32 (9.000B) [2022-11-18 22:19:27,293] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-29511-big-Filter.db (30.117KiB) [2022-11-18 22:19:27,386] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Statistics.db (17.652KiB) [2022-11-18 22:19:27,457] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-TOC.txt (92.000B) [2022-11-18 22:19:27,532] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-CompressionInfo.db (4.201MiB) [2022-11-18 22:19:32,222] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Data.db (6.882GiB) [2022-11-18 22:34:40,281] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Index.db (10.072MiB) [2022-11-18 22:34:51,848] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Summary.db (866.000B) [2022-11-18 22:34:51,897] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Digest.crc32 (10.000B) [2022-11-18 22:34:51,952] INFO: Uploading /data/Cassandra/3.11.2/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/snapshots/medusa-2022-11-18_1722-full/mc-37070-big-Filter.db (6.039KiB) [2022-11-19 10:40:04,228] ERROR: Error occurred during backup: Failed to find uploaded object cluster-01/cass-01/2022-11-18_1722-full/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/mc-23596-big-Data.db in S3 [2022-11-19 10:40:05,815] ERROR: Issue occurred inside handle_backup Name: 2022-11-18_1722-full Error: Failed to find uploaded object cluster-01/cass-01/2022-11-18_1722-full/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/mc-23596-big-Data.db in S3 [2022-11-19 10:40:05,815] INFO: Updated from existing status: 0 to new status: 2 for backup id: 2022-11-18_1722-full [2022-11-19 10:40:05,815] ERROR: Error occurred during backup: Failed to find uploaded object cluster-01/cass-01/2022-11-18_1722-full/data/production_exndb/res1-ac8b1b00218411ec9c63e7061b9094bf/mc-23596-big-Data.db in S3

Nov 20 '22 08:11 mohammad-aburadeh

We tried to change transfer_max_bandwidth to different values (50, 100, 250, and 500MB/s), but that did not help. Also, increasing/decreasing the value of concurrent_transfers did help.

Nov 20 '22 08:11 mohammad-aburadeh

Same symptom here and same actions we took but it always failed for all nodes. We are using awscli 1.19.x version and Medusa 0.13.4 version.

-Sheng

Nov 20 '22 17:11 m934030039

Hi @adejanovski
Could you please help here?

Nov 21 '22 12:11 mohammad-aburadeh

My best guess here is that depending on factors that I'm really unsure of, when we exceed the maximum number of parts for a file (and awscli doesn't always seem to self tune the chunk size itself), the exceeding parts will get ignored... silently. That's why we're not able to read the file through the API after that, leading to the observed behavior.

I think we can work around this by adding a flag to awscli which specifies the size of the file (as silly as it sounds, I would assume it should be able to detect it 🤷 ), to enforce changing the size of the splits so that we don't go over the max threshold.

Our longer term goal is to stop relying on awscli (or gsutil or the azure cli) for multi part uploads by contributing to libcloud in order to make multi part uploads safer and more efficient.

Jun 19 '23 13:06 adejanovski

Hi, we no longer rely on cli utils, so this might no longer be an issue.

@mohammad-aburadeh , in case you're still facing this, could you pleae give it a go with a recent medusa (eg 0.20.1) ?

Apr 04 '24 13:04 rzvoncek

Hi, we no longer rely on cli utils, so this might no longer be an issue.

@mohammad-aburadeh , in case you're still facing this, could you pleae give it a go with a recent medusa (eg 0.20.1) ?

Thanks @rzvoncek. I will upgrade medusa

Apr 06 '24 09:04 mohammad-aburadeh

I hope the new Medusa helped. I don't see any newly opened issues, so I suppose things are stable for now. Please don't hesitate to reach out in case you need more help.

Jun 11 '24 10:06 rzvoncek

cassandra-medusa cassandra-medusa copied to clipboard

Failed to find uploaded object

cassandra-medusa
cassandra-medusa copied to clipboard