percona-server-mongodb-operator icon indicating copy to clipboard operation
percona-server-mongodb-operator copied to clipboard

Physical backup restore stuck on version 1.20.1

Open tomsozolins opened this issue 6 months ago • 4 comments

Report

Restoring from physical backup with point-in-time recovery results in a stuck restore. The cluster has sharding enabled on the collection.

➜ k describe PerconaServerMongoDBRestore
Name:         restore1
Namespace:    demo-mongodb
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2025-07-08T10:02:57Z
  Generation:          1
  Resource Version:    9536705
  UID:                 5dd0c8c8-f3b6-4481-8f68-54104afc552c
Spec:
  Backup Name:   backup1
  Cluster Name:  demo-psmdb-db
  Pitr:
    Type:  latest
Status:
  Pbm Name:     2025-07-08T10:10:12.768452917Z
  Pitr Target:  2025-07-08T08:54:01
  State:        requested
Events:         <none>

More about the problem

Operator does the restore procedure and gets stuck on this log:

2025-07-08T10:10:12.789Z	INFO	Restore state changed	{"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"restore1","namespace":"demo-mongodb"}, "namespace": "demo-mongodb", "name": "restore1", "reconcileID": "a38c71a6-7b43-4d90-aa94-6e31f2136a55", "previous": "waiting", "current": "requested"}

The DB is never restored and the cluster is in initializing state. Restarting operator deployment does not help, it doesn't try to continue the restore process.

Steps to reproduce

  1. Create DB
replsets:
  rs0:
    size: 3
    serviceAccountName: psmdb-operator
    resources:
      limits:
        cpu: 300m
        memory: 1024Mi
      requests:
        cpu: 150m
        memory: 512Mi
    volumeSpec:
      pvc:
        storageClassName: gp3
        resources:
          requests:
            storage: 4Gi
    arbiter:
      enabled: false
      size: 1
  rs1:
    size: 3
    serviceAccountName: psmdb-operator
    resources:
      limits:
        cpu: 300m
        memory: 1024Mi
      requests:
        cpu: 150m
        memory: 512Mi
    volumeSpec:
      pvc:
        storageClassName: gp3
        resources:
          requests:
            storage: 4Gi
    arbiter:
      enabled: false
      size: 1
  sharding:
    configrs:
      size: 3
      serviceAccountName: psmdb-operator
      volumeSpec:
        pvc:
          storageClassName: gp3
          resources:
            requests:
              storage: 4Gi
    mongos:
      size: 3
      resources:
        limits:
          cpu: 1000m
          memory: 1024M
        requests:
          cpu: 300m
          memory: 500M
      serviceAccountName: psmdb-operator

  backup:
    enabled: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/psmdb-operator
    storages:
      s3-eu-north-1:
        main: true
        type: s3
        s3:
          bucket: psmdb-operator
          retryer:
            numMaxRetries: 3
            minRetryDelay: 30ms
            maxRetryDelay: 5m
          region: eu-north-1
    pitr:
      enabled: true
      compressionType: gzip
      compressionLevel: 6
    tasks:
      - name: daily-s3-eu-north-1-physical
        enabled: true
        schedule: "0 0 * * *"
        keep: 30
        type: physical
        storageName: s3-eu-north-1
        compressionType: gzip
        compressionLevel: 6
  1. Login with databaseAdmin user using mongosh cli and create data
use demo
db.demo.insertOne({ msg: "This is the first document" })
  1. Login with clusterAdmin user using mongosh cli and enable sharding
use admin
sh.shardCollection("demo.demo", { _id: 1 })
  1. Create backup
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
    - percona.com/delete-backup
  name: backup1
  namespace: demo-mongodb
spec:
  clusterName: demo-psmdb-db
  storageName: s3-eu-north-1
  type: physical
  1. Restore from backup
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore1
spec:
  clusterName: demo-psmdb-db
  backupName: backup1
  pitr:
    type: latest

Versions

  1. Kubernetes EKS 1.31
  2. Operator 1.20.1
  3. Database 1.20.1

Anything else?

No response

tomsozolins avatar Jul 08 '25 10:07 tomsozolins

I tried downgrading operator and db to 1.20.0 and restore worked fine on empty DB. When i tried restoring DB which has sharded collection it failed by just being stuck, similar to 1.20.1.

Downgrading to 1.19.1 was the only version which properly restored point-in-time recovery on a db which has sharded collection.

tomsozolins avatar Jul 10 '25 08:07 tomsozolins

After trying spinning up empty db and restoring on 1.19.1 it has issue of unable to find backup in s3 even though crd backup object exists and data is in s3.

There was no such issue in version 1.20.1 but it gets stuck on replaying oplog forever. Sometimes the restore stucks on requested state.

tomsozolins avatar Jul 11 '25 18:07 tomsozolins

lets us try to reproduce as well, @eleo007 can you pick it up?

gkech avatar Aug 22 '25 08:08 gkech

I'll try to reproduce when I have the chance.

egegunes avatar Sep 12 '25 09:09 egegunes