velero
velero copied to clipboard
Restic restore failing when restoring backup to different EKS cluster in different AWS region
What steps did you take and what happened:
Setup
- I am running two Amazon EKS clusters, one in us-east-2 (primary) and one in us-west-1 (secondary)
- Kubernetes version 1.23
- I created an S3 bucket in us-east-2.
Steps
- I installed velero 1.9.2 (using Helm) with Restic on both clusters, using the same bucket, with access mode set to ReadWrite in primary and ReadOnly in secondary
- I installed wordpress on the primary cluster using helm
- I created a backup of the wordpress namespace on the primary cluster
- I deleted the wordpress namespace from the primary cluster
- I successfully restored the wordpress namespace, resources, and volumes from the backup. So far so good.
- I then tried to restore the same backup on the secondary cluster.
- The Kubernetes resources were restored, but Restic failed to restore the volumes.
- I can see that EBS volumes were in fact created in the secondary region, but Restic failed to restore the data.
- Errors from restore describe as follows:
Errors: Velero: restic repository is not ready: error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header : exit status 1 restic repository is not ready: error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header : exit status 1
I am confused by the fact that the restore action is executing a restic init
. The repository already exists, so it just needs an integrity check?
See attached debug bundle.
My helm values file for velero on the secondary is as follows: (primary is similar but ReadWrite, and different role with same permissions)
image:
repository: velero/velero
tag: v1.9.2
pullPolicy: IfNotPresent
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.5.1
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
configuration:
provider: aws
# features: EnableCSI
defaultVolumesToRestic: true
backupStorageLocation:
name: primary
bucket: 043124067543-velero-primary
accessMode: ReadOnly
default: true
config:
region: us-east-2
deployRestic: true
restic:
podVolumePath: /var/lib/kubelet/pods
privileged: false
# Pod priority class name to use for the Restic daemonset. Optional.
priorityClassName: ""
# Resource requests/limits to specify for the Restic daemonset deployment. Optional.
# https://velero.io/docs/v1.6/customize-installation/#customize-resource-requests-and-limits
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1024Mi
serviceAccount:
server:
create: true
name: veleros3
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::043124067543:role/ServiceAccount-Velero-Backup-Secondary"
What did you expect to happen:
I expected the Restic volume restore to work in the secondary region, just as it did in the primary region.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
-
kubectl logs deployment/velero -n velero
-
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
-
velero backup logs <backupname>
-
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
-
velero restore logs <restorename>
Anything else you would like to add: I am not sure if this is a bug, or if I am doing something wrong in my config. It works fine on the same cluster in the same region, so what is different about a different cluster/region that I may have required a different config parameter somewhere?
Environment:
- Velero version (use
velero version
): 1.9.2 - Velero features (use
velero client config get features
): <NOT SET> - Kubernetes version (use
kubectl version
): v1.23.10-eks-15b7512 - Kubernetes installer & version: EKS
- Cloud provider or hardware configuration: AWS
- OS (e.g. from
/etc/os-release
): Amazon Linux 2
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- :+1: for "I would like to see this bug fixed as soon as possible"
- :-1: for "There are more important bugs to focus on right now" bundle-2022-10-05-20-15-10.tar.gz
Just adding the output of velero restic repo get wordpress-primary-8gzr6 -o yaml
on the secondary cluster. Again it shows the failed restic init
on a repo that already exists. Why is it doing an init
?
apiVersion: velero.io/v1
kind: ResticRepository
metadata:
creationTimestamp: "2022-10-05T18:23:05Z"
generateName: wordpress-primary-
generation: 3
labels:
velero.io/storage-location: primary
velero.io/volume-namespace: wordpress
managedFields:
- apiVersion: velero.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:generateName: {}
f:labels:
.: {}
f:velero.io/storage-location: {}
f:velero.io/volume-namespace: {}
f:spec:
.: {}
f:backupStorageLocation: {}
f:maintenanceFrequency: {}
f:resticIdentifier: {}
f:volumeNamespace: {}
f:status:
.: {}
f:message: {}
f:phase: {}
manager: velero-server
operation: Update
time: "2022-10-05T18:23:26Z"
name: wordpress-primary-8gzr6
namespace: velero
resourceVersion: "39841"
uid: ddca26ef-88e9-4055-bb99-b778038b8cb7
spec:
backupStorageLocation: primary
maintenanceFrequency: 168h0m0s
resticIdentifier: s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress
volumeNamespace: wordpress
status:
message: |-
error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header
: exit status 1
phase: NotReady
I'm not sure what's going on, but I notice in that error message that something is trying to access the bucket using a us-west-1 URL rather than us-east-2. It could be that some code in the restic/velero codebase is pulling region from the wrong location.
Thanks @sseago. Yes no matter what I try, Restic ignores the region I have set and tries to connect to S3 bucket using the region that the cluster is running in.
I have tried using all of the the following with no success
- backupStorageLocation.config.region: us-east-2
- backupStorageLocation.config.s3Url: https://043124067543-velero-primary.s3.dualstack.us-east-2.amazonaws.com/
- volumeSnapshotLocation.config.region: us-east-2
So the question boils down to: what is the correct way to tell Restic to use a bucket in a different region to the one it is running in?
Lookoing at the restic docs, I think I need to figure out a way to get velero to add the option -o s3.region="us-east-2"
when calling restic init
. Is there any way to configure velero to add option parameters to restic commands?
There is no easy way to add a new parameter in the Restic command. From the Restic document you post, I think adding an environment variable AWS_DEFAULT_REGION
in Velero server deployment may make it works.
I'm not sure what's going on here. Restic shouldn't be using the region the cluster is running in -- it should be using the BSL region. If restic is using cluster region instead of BSL region, that sounds like a bug. We shouldn't need to pass this in separately to restic. Restic should use the value from the BSL somehow.
@sseago agreed, but it is Velero that is invoking Restic, and the BSL is a Velero object. So Velero somehow needs to communicate that BSL region id through to the Restic CLI - which is currently not happening. Agreed it is a bug.
The Restic docs only seem to offer 2 ways to do this: an environment variable, or a command line option.
There is no easy way to add a new parameter in the Restic command. From the Restic document you post, I think adding an environment variable
AWS_DEFAULT_REGION
in Velero server deployment may make it works.
I tried this by changing the Restic DaemonSet container spec to include:
env:
- name: AWS_DEFAULT_REGION
value: us-east-2
Then I restarted the Restic pods, but unfortunately it did not work. Got the same error as reported previously.
One other thing to try. Looking at restic github issues, at least one user who had this error resolved it by updating the IAM policy to add "s3:GetBucketLocation". Since the failure happens when the initial request (to the default region) attempts to redirect to a different region, it's possible that this permission is missing. I'm not sure this will help (since it may be that in this case we're dealing with the opposite problem -- restic attempting to redirect to the wrong region), but it's worth trying. If you add this to your user bucket policy, does it help?
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"statement1",
"Effect":"Allow",
"Action":[
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"
],
"Resource":[
"arn:aws:s3:::*"
]
}
]
}
Thanks @sseago , but I already had that permission in the IAM policy.
Velero always set the region name in the AWS URL like this https://bucket-name.s3.region-code.amazonaws.com
, where the region-code is replaced by the value specified in backupStorageLocation.config
.
For Restic, if AWS_DEFAULT_REGION
is not set, it (actually the minio client) gets the region name from the URL; otherwise, it respects the value in AWS_DEFAULT_REGION
all the time.
Therefore, generally, this behavior works in the case mentioned in the current issue. It means the current issue is not a generic problem.
We may need to check where the region name us-west-1
is specified, because if this value is not set in any place, it should not go to the connection URL.
It must not be set in BSL of Velero, because if we check the Restic command Velero runs, the region name is correct: --repo=s3:s3-us-east-2.amazonaws.com
.
Therefore, is there any possibility that AWS_DEFAULT_REGION
is set once more and overwriten with us-west-1
?
@rizblie I tried to reproduce the problem using velero v1.10.0, installed via CLI and credential file but things seemed to work.
I setup 2 EKS clusters on us-east-2
and us-west-1
, using the same command for installation so the velero instances on both cluster point to the same bucket:
./velero install \
--provider aws \
--plugins gcr.io/velero-gcp/velero-plugin-for-aws:v1.6.0 \
--bucket jt-restic-ue2 \
--secret-file xxxxxxxx/aws-credentials \
--backup-location-config region=us-east-2 \
--use-node-agent \
--uploader-type restic \
--wait
I tried to run a backup on the cluster on us-east-2
and restore it on the cluster on us-west-1
, the restore was successful, and the in the spec of the backuprepository
it points to us-east-2
:
k get backuprepositories -n velero -oyaml
.....
spec:
backupStorageLocation: default
maintenanceFrequency: 168h0m0s
repositoryType: restic
resticIdentifier: s3:s3-us-east-2.amazonaws.com/jt-restic-ue2/restic/nginx-example
volumeNamespace: nginx-example
....
Could you try using velero v1.10 and credentials rather than AWS role?
I don't quite understand why restic tries to head the URL us-west-1
when the repo id in the command points to us-east-2
My guess is some setting on the EKS confused restic, which may be a bug in restic.
Closing this issue as not reproducible.