longhorn icon indicating copy to clipboard operation
longhorn copied to clipboard

[BUG] Backups to Backblaze B2 sporadically fail with SignatureDoesNotMatch

Open cdhowie opened this issue 1 year ago • 8 comments
trafficstars

Describe the bug

We have a pretty bog-standard backup configuration targeting Backblaze B2's S3-compatible endpoint. Fairly consistently, backups will fail with a SignatureDoesNotMatch error coming back from B2. The error doesn't always occur at the same point. For example, I've seen it occur at 2%, 48%, and other random percentages.

More troubling, often when the error occurs we'll see the error in the instance manager logs, but the backup resource in Kubernetes will not transition to an Error state, but be stuck in InProgress. This is a rather serious problem because if this happens during a recurring backup then Longhorn effectively suspends all recurring backup jobs until a cluster operator notices the problem and manually deletes the stuck backup resource. (This is likely a separate issue than the SignatureDoesNotMatch error, but thought it was worth noting.)

[rocket-r-b738e4e1] time="2024-09-01T06:49:32Z" level=error msg="Failed to back up volume rocket snapshot 3145537f-d983-4cfb-9a99-8826ca00c1ac block at offset 4729077760 size 2097152" func=backupstore.backupMapping file="deltablock.go:438" error="failed to put object: longhorn-backups/backupstore/volumes/31/b1/rocket/blocks/71/75/71756799a7b82366685937cdec6bb5c795858d9edd62393e3459d72cbf86642f.blk response: {\n\n} error: AWS Error:  SignatureDoesNotMatch Signature validation failed <nil>\n403 87ca83f9ba55dd88\n"
[rocket-r-b738e4e1] time="2024-09-01T06:49:32Z" level=error msg="Failed to backup volume rocket snapshot 3145537f-d983-4cfb-9a99-8826ca00c1ac" func=backupstore.performBackup file="deltablock.go:548" error="failed to put object: longhorn-backups/backupstore/volumes/31/b1/rocket/blocks/71/75/71756799a7b82366685937cdec6bb5c795858d9edd62393e3459d72cbf86642f.blk response: {\n\n} error: AWS Error:  SignatureDoesNotMatch Signature validation failed <nil>\n403 87ca83f9ba55dd88\n"
time="2024-09-01T06:49:32Z" level=error msg="Failed to perform backup for volume rocket snapshot 3145537f-d983-4cfb-9a99-8826ca00c1ac" func=backupstore.CreateDeltaBlockBackup.func3 file="deltablock.go:295" error="failed to put object: longhorn-backups/backupstore/volumes/31/b1/rocket/blocks/71/75/71756799a7b82366685937cdec6bb5c795858d9edd62393e3459d72cbf86642f.blk response: {\n\n} error: AWS Error:  SignatureDoesNotMatch Signature validation failed <nil>\n403 87ca83f9ba55dd88\n"

To Reproduce

I have not found a consistent way to reproduce other than to configure backups targeting Backblaze B2 and try backing up a volume. It does seem to happen more frequently the larger a volume is, but the problem by its nature seems to occur somewhat unpredictably.

Expected behavior

The backup should not cause a SignatureDoesNotMatch error, and if an error does occur, the backup should properly transition to the Error state.

Support bundle for troubleshooting

I can provide a bundle if absolutely necessary. I'm not sure that it will actually be terribly useful (except possibly for Longhorn's own pod logs) given that the issue seems specific to the S3 backup client when used with B2, so I would think any cluster configured to backup to B2 should be capable of observing these issues. (I find the scope of the bundle to be somewhat invasive as well, capturing seemingly everything but secrets, and I don't exactly want to publish the entire state of a private cluster.)

Environment

  • Longhorn version: 1.6.2
  • Impacted volume (PV): Any volume, though we see it most often with the "rocket" PV due to its size.
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Vanilla Kubernetes (kubeadm) v1.30.4
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 5
  • Node config
    • OS type and version: Debian Bookworm
    • Kernel version: 6.1.0
    • CPU per node: 4
    • Memory per node: 8GB
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps): 1Gbps theoretical, tested at ~730Mbps, average typical inter-node traffic actually observed ~1Mbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Linode VMs, "bare metal" install (not using LKE)
  • Number of Longhorn volumes in the cluster: 20

cdhowie avatar Sep 01 '24 07:09 cdhowie

Hi @cdhowie,

Could you try these tests first?

  1. Check if the time on the client/server is synchronized. Clock drift affects S3 signature.
  2. Could you use another tool like aws-cli to check if the s3 operation works to the bucket?
  3. Create new credentials to check if the backup works.

mantissahz avatar Sep 02 '24 01:09 mantissahz

Thanks for the reply and the suggestions.

Check if the time on the client/server is synchronized. Clock drift affects S3 signature.

The servers are all in sync using NTP and I just verified that the specific server that the most recent failure originates from is in sync with sub-second drift.

Could you use another tool like aws-cli to check if the s3 operation works to the bucket?

I tested the credentials with rclone before initially configuring Longhorn, though I did so using the B2 backend and not the S3 backend. I can verify that it works with an S3 client this week.

However, do note that many block uploads do succeed; the failures are intermittent and occur at different points in the backup when it is retried. It succeeds individual block uploads far more than it fails them (but of course a single failure causes the whole backup to fail).

Create new credentials to check if the backup works.

I'll give this a shot. I also want to test using S3 proper and see if the issues go away.

I haven't ruled out an issue on Backblaze's side, though we are storing quite a lot of data in other B2 buckets and have never seen this problem with any other S3 client before -- it only seems to happen with Longhorn, so I'm very suspicious that there's a subtle bug in whatever S3 client library Longhorn uses that may only manifest when it is used with B2.

cdhowie avatar Sep 02 '24 03:09 cdhowie

Could you provide the support bundle? And could you show the backup target URL you set in the LH Settings here first?

mantissahz avatar Sep 02 '24 07:09 mantissahz

@mantissahz

Just wanted to follow up because I am seeing this issue as well. Stuff I've tested:

  • Time is synced on the system, this doesn't appear to be the issue.
  • Tested that the S3 bucket is accessible with the token. I am able to log in using the AWS-CLI tool and see data on the bucket with the token. Not an issue AFAIK.

What is strange is this doesn't appear to be consistent, around 1/3 of the backups fail. The interesting part is that they don't fail right away. They get ~20-80% of the way through the backup, then they fail. So I'm thinking it's 2 possible scenarios:

  1. The authentication works for a while, then it runs into an issue suddenly fails then gives up on the backup.
  2. The authentication never works, it runs into an issue once it tries the first upload.

There is no size issue as some small volumes and large volumes fail, I think we notice that the large volumes are failing more often because they have more chunks that are being uploaded.

This was working a couple months ago, but I checked and suddenly the backups silently started failing which caused a ripple effect --> nothing is getting backed up.

Nathan-Nesbitt avatar Sep 04 '24 00:09 Nathan-Nesbitt

I’ve seen this in the past too and sadly had to swap to AWS S3 (at the expense of my wallet!)

mike12806 avatar Sep 05 '24 02:09 mike12806

Thanks everyone for chiming in.

I am splitting this into two issues. I observed both of them together, but they should be separate, I think:

  • B2 backups sometimes fail with SignatureDoesNotMatch even though credentials are valid and time is in sync (this issue).
  • A failure during backup sometimes does not cause the k8s Backup record to transition to the Error state (#9417).

cdhowie avatar Sep 07 '24 21:09 cdhowie

Just chiming in to say that I also encountered this SignatureDoesNotMatch error on Backblaze B2. Today, I switched over to iDrive e2 and ran into the same thing! I managed to successfully back up a ~70GB volume, which is more than I can say for Backblaze, but another backup attempt on a 3 TB volume failed 58 GBs in.

Edit: Running v1.6.3

stove-panini avatar Oct 20 '24 23:10 stove-panini

cc @mantissahz

derekbit avatar Oct 21 '24 02:10 derekbit

Jumping on this bandwagon to say that I am seeing the exact same issue with B2 on 1.7.2.

terratrax avatar Nov 12 '24 15:11 terratrax

@mantissahz Let's address the B2 compatibility issue in v1.9.0. cc @innobead

derekbit avatar Nov 22 '24 13:11 derekbit

Similar issues here on my homelab and 1.9.0. Small volumes always work but a bit bigger one with 130G fails every time. Last night it got to 21% and failed.

Name: daily-ba-c2b4f183-8050-421a-b715-909ce2ad1547

Created: 2025-07-04T03:00:02Z

Size: 0 Bi

Created By User: True

Removed: False

Progress: 21%

Replicas: pvc-f7455187-5b88-49ad-85ed-4a90a4daa2fb-r-e0462dca

Backups: backup-cc311f5984a3498e

pvc-f7455187-5b88-49ad-85ed-4a90a4daa2fb-r-e0462dca: failed to write data during saving blocks: failed to put object: longhorn-backups/backupstore/volumes/e6/0c/pvc-f7455187-5b88-49ad-85ed-4a90a4daa2fb/blocks/9c/13/9c13f4fc4debe4ffc978c7d909e3ff5c1205b1054e75e456bc3c9567f9ba51ae.blk response: { } error: AWS Error: AccessDenied Signature validation failed <nil> 403 0f2717fd8fbb8f16

hannut avatar Jul 04 '25 07:07 hannut

@mantissahz Please help investigate and address the issue in v1.10.

derekbit avatar Jul 06 '25 03:07 derekbit

Having the same issue, large backup gets to about 12% before failing. Wondering if Backblaze B2 just sucks at this point because whenever you see issues like this pop up online (not just for Longhorn), it's almost always Backblaze B2 that is being used.

CppBunny avatar Aug 02 '25 12:08 CppBunny

I'd be pretty surprised if it's backblaze. Been using them for many years and no issues; I switched to velero for my volume backups because of this and that works fine, even on the very large initial uploads.

danmur avatar Aug 02 '25 13:08 danmur

I'd be pretty surprised if it's backblaze. Been using them for many years and no issues; I switched to velero for my volume backups because of this and that works fine, even on the very large initial uploads.

Disabled versioning on my B2 bucket and the errors seem to have stopped. Backups completed with no issues.

CppBunny avatar Aug 02 '25 16:08 CppBunny

Disabled versioning on my B2 bucket and the errors seem to have stopped. Backups completed with no issues.

I wonder if it's just coincidence. We've never had versioning enabled and used to hit this issue fairly regularly.

cdhowie avatar Aug 02 '25 17:08 cdhowie