longhorn icon indicating copy to clipboard operation
longhorn copied to clipboard

[BUG] Backup stuck at 0% for hours

Open pwurbs opened this issue 7 months ago • 19 comments

Describe the Bug

When we start a manual incremental backup of a volume, then the progress is stuck at 0% for hours. When it then starts to show progress (>0%), then it goes pretty fast. The volume contains many small files and there are many deltas between snapshots.

To Reproduce

Start a backup and watch the progress

Expected Behavior

I would at least expect any status/progress information instead of only seeing 0% for hours. I assume, that there is any calculation ongoing which blocks have changed and must be updated, but there should be always visibility of the status/progress.

Support Bundle for Troubleshooting

Not possible

Environment

  • Longhorn version: 1.7.3
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 6
  • Node config
    • OS type and version: AlmaLinux 8
    • Kernel version:
    • CPU per node: 16
    • Memory per node: 64GB
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps): 8Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud Server
  • Number of Longhorn volumes in the cluster: 50

Additional context

Could you give us guidance please, which factors influence the duration of the init state (0%). Yes, server performance, disk and bandwidth will influence. But is the speed also depending on the number of snapshots in the chain or the number of existing backups? We would like to reduce the stuck time period and must know how.

Workaround and Mitigation

No response

pwurbs avatar Apr 25 '25 12:04 pwurbs

Hi @pwurbs, Some quick questions:

  • What's the type of the backup store (S3, NFS, ...)?
  • What's the volume actual size now?
  • Could you provide the support bundle?

Maybe you could provide the backup information first with the command:

kubectl -n longhorn-system get backup [backup-name] -oyaml

mantissahz avatar Apr 25 '25 12:04 mantissahz

  • S3
  • 2TB
  • yaml:
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  creationTimestamp: "2025-04-25T09:42:31Z"
  finalizers:
  - longhorn.io
  generation: 2
  labels:
    backup-volume: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  name: backup-f6494a3f38504fb1
  namespace: longhorn-system
  resourceVersion: "607515223"
  uid: 7b6cd563-b347-46b2-a923-47b7abd06288
spec:
  backupMode: incremental
  labels:
    KubernetesStatus: '{"pvName":"pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4","pvStatus":"Bound","namespace":"foo","pvcName":"pvc","lastPVCRefAt":"","workloadsStatus":[{"podName":"pod","podStatus":"Running","workloadName":"workload","workloadType":"StatefulSet"}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
  snapshotName: 8141fe85-db80-4652-b53f-afd5e1d5fc29
  syncRequestedAt: "2025-04-25T12:01:33Z"
status:
  backupCreatedAt: "2025-04-25T12:01:29Z"
  compressionMethod: lz4
  labels:
    KubernetesStatus: '{"pvName":"pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4","pvStatus":"Bound","namespace":"foo","pvcName":"pvc","lastPVCRefAt":"","workloadsStatus":[{"podName":"pod","podStatus":"Running","workloadName":"workload","workloadType":"StatefulSet"}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
  lastSyncedAt: "2025-04-25T12:01:34Z"
  messages: null
  newlyUploadDataSize: "23489908561"
  ownerID: node01
  progress: 100
  reUploadedDataSize: "0"
  replicaAddress: tcp://10.42.13.127:10642
  size: "2344418803712"
  snapshotCreatedAt: "2025-04-25T09:42:59Z"
  snapshotName: 8141fe85-db80-4652-b53f-afd5e1d5fc29
  state: Completed
  url: s3://[email protected]/folder/?backup=backup-f6494a3f38504fb1&volume=pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  volumeBackingImageName: ""
  volumeCreated: "2025-02-05T04:21:06Z"
  volumeName: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  volumeSize: "2813203578880"

pwurbs avatar Apr 25 '25 12:04 pwurbs

Hi @pwurbs,

This backup is completed. Do you have any backups that are stuck at 0%?

mantissahz avatar Apr 25 '25 13:04 mantissahz

I see. We will provide the backup yaml during stuck status on Monday

pwurbs avatar Apr 25 '25 13:04 pwurbs

@pwurbs

It could be that the size of the volume is 2.5 TB so it takes a long time to retry fileextends at the begining before uploading the data to the backup store.

What is the version of your kernel? This can affect the pre-uploading time too

PhanLe1010 avatar Apr 29 '25 00:04 PhanLe1010

Here ist the backup yaml during the stuck status. This status took about 3 hours before the actual progress <0% started. The message "Failed to get the Snapshot" is unclear. The backup snapshot was taken and is there. Otherwise the backup could not have been completed.

apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  creationTimestamp: "2025-04-28T05:41:38Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    backup-volume: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  name: backup-84db84b592424b7e
  namespace: longhorn-system
  resourceVersion: "608882264"
  uid: ed7ef9a8-458e-44fc-b5c8-247fbf00015e
spec:
  backupMode: incremental
  labels:
    KubernetesStatus: '{"pvName":"pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4","pvStatus":"Bound","namespace":"foo","pvcName":"pvc","lastPVCRefAt":"","workloadsStatus":[{"podName":"pod","podStatus":"Running","workloadName":"workload","workloadType":"StatefulSet"}],"lastPodRefAt":""}'
    longhorn.io/volume-access-mode: rwo
  snapshotName: 1a4d11a5-3062-4f67-bbd5-9cf499249298
  syncRequestedAt: null
status:
  backupCreatedAt: ""
  compressionMethod: ""
  labels: null
  lastSyncedAt: null
  messages:
    info: Failed to get the Snapshot 1a4d11a5-3062-4f67-bbd5-9cf499249298
  newlyUploadDataSize: ""
  ownerID: node01
  progress: 0
  reUploadedDataSize: ""
  replicaAddress: tcp://10.42.13.127:10642
  size: ""
  snapshotCreatedAt: "2025-04-28T05:41:37Z"
  snapshotName: 1a4d11a5-3062-4f67-bbd5-9cf499249298
  state: InProgress
  url: ""
  volumeBackingImageName: ""
  volumeCreated: ""
  volumeName: ""
  volumeSize: "2813203578880"

pwurbs avatar Apr 29 '25 05:04 pwurbs

Kernel version: 4.18.0-553.50.1.el8_10.x86_64 There is also another volume with nearly 1.6TB. Here the 0% state only takes 10min. And the main intention of this ticket is to get to know, how the stuck state can be influenced, see original post.

pwurbs avatar Apr 29 '25 05:04 pwurbs

Could you show the information using the command?

kubectl -n longhorn-system get snapshot 1a4d11a5-3062-4f67-bbd5-9cf499249298 -oyaml

Failed to get the Snapshot 1a4d11a5-3062-4f67-bbd5-9cf499249298

As @PhanLe1010 mentioned,

It could be that the size of the volume is 2.5 TB so it takes a long time to retry fileextends at the begining before uploading the data to the backup store.

The snapshot will be processed by the longhorn-engine, and it might take a long time for a 2.5TB volume. The error message is caused by the snapshot CR might not have been created yet.

mantissahz avatar Apr 29 '25 06:04 mantissahz

Snapshot Yaml:

apiVersion: longhorn.io/v1beta2
kind: Snapshot
metadata:
  creationTimestamp: "2025-04-28T05:41:39Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    longhornvolume: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  name: 1a4d11a5-3062-4f67-bbd5-9cf499249298
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
    uid: 251ffa94-f158-4805-b901-a2d754106540
  resourceVersion: "608952990"
  uid: 32d83813-4bab-47ea-8156-9efc2a317bd5
spec:
  createSnapshot: false
  labels: null
  volume: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
status:
  checksum: ""
  children:
    snapshot-f6574122-bf7a-4b23-a287-beba4540c4b6: true
  creationTime: "2025-04-28T05:41:37Z"
  labels: {}
  markRemoved: false
  ownerID: ""
  parent: snapshot-d1b67f2a-68c5-40b6-898e-e169dfcb2858
  readyToUse: true
  restoreSize: 2813203578880
  size: 6639845376
  userCreated: true

pwurbs avatar Apr 29 '25 06:04 pwurbs

Could you provide the volume engine information as well?

kubectl -n longhorn-system get engine pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-e-0[not sure] -oyaml

Could you provide the support bundle, and we can investigate the logs of instance-manager and engine pods?

mantissahz avatar Apr 29 '25 06:04 mantissahz

Here is the engine yaml. I know we have some more snapshots in the chain. We already deleted snapshots before backup. But this didn't change anything. Providing a support bundle is not easy due to compliance. But I could provide logs if you tell from what / when and provide an email address.

apiVersion: longhorn.io/v1beta2
kind: Engine
metadata:
  creationTimestamp: "2025-01-24T14:38:40Z"
  finalizers:
  - longhorn.io
  generation: 175
  labels:
    longhornnode: node01
    longhornvolume: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  name: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-e-0
  namespace: longhorn-system
  ownerReferences:
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
    name: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
    uid: 251ffa94-f158-4805-b901-a2d754106540
  resourceVersion: "609418361"
  uid: 1b49d100-62c6-4564-a9eb-ff75a6457c72
spec:
  active: true
  backendStoreDriver: ""
  backupVolume: ""
  dataEngine: v1
  desireState: running
  disableFrontend: false
  engineImage: ""
  frontend: blockdev
  image: rancher/mirrored-longhornio-longhorn-engine:v1.7.3
  logRequested: false
  nodeID: node01
  replicaAddressMap:
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-5b7e4be3: 10.42.15.31:10675
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-6e39be3c: 10.42.13.127:10642
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-dabf1faa: 10.42.17.8:10476
  requestedBackupRestore: ""
  requestedDataSource: ""
  revisionCounterDisabled: false
  salvageRequested: false
  snapshotMaxCount: 250
  snapshotMaxSize: "0"
  unmapMarkSnapChainRemovedEnabled: true
  upgradedReplicaAddressMap: {}
  volumeName: pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  volumeSize: "2813203578880"
status:
  backupStatus: null
  cloneStatus:
    tcp://10.42.13.127:10642:
      error: ""
      fromReplicaAddress: ""
      isCloning: false
      progress: 0
      snapshotName: ""
      state: ""
    tcp://10.42.15.31:10675:
      error: ""
      fromReplicaAddress: ""
      isCloning: false
      progress: 0
      snapshotName: ""
      state: ""
    tcp://10.42.17.8:10476:
      error: ""
      fromReplicaAddress: ""
      isCloning: false
      progress: 0
      snapshotName: ""
      state: ""
  conditions:
  - lastProbeTime: ""
    lastTransitionTime: "2025-01-24T14:38:40Z"
    message: ""
    reason: ""
    status: "True"
    type: InstanceCreation
  - lastProbeTime: ""
    lastTransitionTime: "2025-04-23T01:36:07Z"
    message: ""
    reason: ""
    status: "False"
    type: FilesystemReadOnly
  currentImage: rancher/mirrored-longhornio-longhorn-engine:v1.7.3
  currentReplicaAddressMap:
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-5b7e4be3: 10.42.15.31:10675
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-6e39be3c: 10.42.13.127:10642
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-dabf1faa: 10.42.17.8:10476
  currentSize: "2813203578880"
  currentState: running
  endpoint: /dev/longhorn/pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4
  instanceManagerName: instance-manager-31393ee67f35e278ca37b31773f331d7
  ip: 10.42.15.31
  isExpanding: false
  lastExpansionError: ""
  lastExpansionFailedAt: ""
  lastRestoredBackup: ""
  logFetched: false
  ownerID: node01
  port: 10664
  purgeStatus:
    tcp://10.42.13.127:10642:
      error: ""
      isPurging: false
      progress: 100
      state: complete
    tcp://10.42.15.31:10675:
      error: ""
      isPurging: false
      progress: 100
      state: complete
    tcp://10.42.17.8:10476:
      error: ""
      isPurging: false
      progress: 100
      state: complete
  rebuildStatus: {}
  replicaModeMap:
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-5b7e4be3: RW
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-6e39be3c: RW
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-dabf1faa: RW
  replicaTransitionTimeMap:
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-5b7e4be3: "2025-04-25T08:47:44Z"
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-6e39be3c: "2025-04-25T05:33:04Z"
    pvc-0a67e13c-9446-452d-96ec-1f340f82dfa4-r-dabf1faa: "2025-04-25T05:33:04Z"
  restoreStatus:
    tcp://10.42.13.127:10642:
      backupURL: ""
      currentRestoringBackup: ""
      isRestoring: false
      lastRestored: ""
      state: ""
    tcp://10.42.15.31:10675:
      backupURL: ""
      currentRestoringBackup: ""
      isRestoring: false
      lastRestored: ""
      state: ""
    tcp://10.42.17.8:10476:
      backupURL: ""
      currentRestoringBackup: ""
      isRestoring: false
      lastRestored: ""
      state: ""
  salvageExecuted: false
  snapshotMaxCount: 250
  snapshotMaxSize: "0"
  snapshots:
    66c3359f-caa4-4b0b-a091-c5b1f0abdcf5:
      children:
        snapshot-8872a549-73de-45c8-a565-8e3b610cb307: true
      created: "2025-04-27T05:22:14Z"
      labels: {}
      name: 66c3359f-caa4-4b0b-a091-c5b1f0abdcf5
      parent: snapshot-339023fd-9ecf-469e-a90a-8369f3fd2d03
      removed: false
      size: "2860630016"
      usercreated: true
    930bb73f-d7ce-47e4-8a32-568638a8d2a8:
      children:
        volume-head: true
      created: "2025-04-29T05:40:39Z"
      labels: {}
      name: 930bb73f-d7ce-47e4-8a32-568638a8d2a8
      parent: snapshot-7408532d-9002-44b2-b06f-5d16be11100c
      removed: false
      size: "5629947904"
      usercreated: true
    1181b674-e401-4abc-ae3f-2a429e15adcb:
      children:
        44293331-be49-4608-949a-7d410e39db47: true
      created: "2025-04-24T19:14:17Z"
      labels: {}
      name: 1181b674-e401-4abc-ae3f-2a429e15adcb
      parent: ""
      removed: false
      size: "2111259172864"
      usercreated: true
    1a4d11a5-3062-4f67-bbd5-9cf499249298:
      children:
        snapshot-f6574122-bf7a-4b23-a287-beba4540c4b6: true
      created: "2025-04-28T05:41:37Z"
      labels: {}
      name: 1a4d11a5-3062-4f67-bbd5-9cf499249298
      parent: snapshot-d1b67f2a-68c5-40b6-898e-e169dfcb2858
      removed: false
      size: "6639845376"
      usercreated: true
    92a84ff9-b138-4928-829e-628027250bfd:
      children:
        snapshot-339023fd-9ecf-469e-a90a-8369f3fd2d03: true
      created: "2025-04-26T04:21:07Z"
      labels: {}
      name: 92a84ff9-b138-4928-829e-628027250bfd
      parent: 44293331-be49-4608-949a-7d410e39db47
      removed: false
      size: "103832834048"
      usercreated: true
    44293331-be49-4608-949a-7d410e39db47:
      children:
        92a84ff9-b138-4928-829e-628027250bfd: true
      created: "2025-04-25T03:51:12Z"
      labels: {}
      name: 44293331-be49-4608-949a-7d410e39db47
      parent: 1181b674-e401-4abc-ae3f-2a429e15adcb
      removed: false
      size: "2945163264"
      usercreated: true
    snapshot-278165cc-378c-4bc9-bf9f-8cd5a11af0da:
      children:
        snapshot-db9f9f79-a475-4d2c-a111-f350f2d1aa0b: true
      created: "2025-04-28T15:02:45Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-278165cc-378c-4bc9-bf9f-8cd5a11af0da
      parent: snapshot-5a9ee446-756b-40f4-8a93-acbf847081ae
      removed: false
      size: "27801444352"
      usercreated: true
    snapshot-2c94cfe7-f7d8-49d7-ae1a-08e36e1f2098:
      children:
        snapshot-b865d9ec-fd33-4610-a1b9-95598cce0760: true
      created: "2025-04-27T15:01:37Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-2c94cfe7-f7d8-49d7-ae1a-08e36e1f2098
      parent: snapshot-8872a549-73de-45c8-a565-8e3b610cb307
      removed: false
      size: "4886077440"
      usercreated: true
    snapshot-5a9ee446-756b-40f4-8a93-acbf847081ae:
      children:
        snapshot-278165cc-378c-4bc9-bf9f-8cd5a11af0da: true
      created: "2025-04-28T12:01:53Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-5a9ee446-756b-40f4-8a93-acbf847081ae
      parent: snapshot-f6574122-bf7a-4b23-a287-beba4540c4b6
      removed: false
      size: "24676896768"
      usercreated: true
    snapshot-8872a549-73de-45c8-a565-8e3b610cb307:
      children:
        snapshot-2c94cfe7-f7d8-49d7-ae1a-08e36e1f2098: true
      created: "2025-04-27T12:00:01Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-8872a549-73de-45c8-a565-8e3b610cb307
      parent: 66c3359f-caa4-4b0b-a091-c5b1f0abdcf5
      removed: false
      size: "4279427072"
      usercreated: true
    snapshot-339023fd-9ecf-469e-a90a-8369f3fd2d03:
      children:
        66c3359f-caa4-4b0b-a091-c5b1f0abdcf5: true
      created: "2025-04-26T18:00:47Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-339023fd-9ecf-469e-a90a-8369f3fd2d03
      parent: 92a84ff9-b138-4928-829e-628027250bfd
      removed: false
      size: "26687328256"
      usercreated: true
    snapshot-7408532d-9002-44b2-b06f-5d16be11100c:
      children:
        930bb73f-d7ce-47e4-8a32-568638a8d2a8: true
      created: "2025-04-28T23:01:02Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-7408532d-9002-44b2-b06f-5d16be11100c
      parent: snapshot-db9f9f79-a475-4d2c-a111-f350f2d1aa0b
      removed: false
      size: "3615174656"
      usercreated: true
    snapshot-b865d9ec-fd33-4610-a1b9-95598cce0760:
      children:
        snapshot-d1b67f2a-68c5-40b6-898e-e169dfcb2858: true
      created: "2025-04-27T18:02:59Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-b865d9ec-fd33-4610-a1b9-95598cce0760
      parent: snapshot-2c94cfe7-f7d8-49d7-ae1a-08e36e1f2098
      removed: false
      size: "3618750464"
      usercreated: true
    snapshot-d1b67f2a-68c5-40b6-898e-e169dfcb2858:
      children:
        1a4d11a5-3062-4f67-bbd5-9cf499249298: true
      created: "2025-04-27T23:00:47Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-d1b67f2a-68c5-40b6-898e-e169dfcb2858
      parent: snapshot-b865d9ec-fd33-4610-a1b9-95598cce0760
      removed: false
      size: "2128678912"
      usercreated: true
    snapshot-db9f9f79-a475-4d2c-a111-f350f2d1aa0b:
      children:
        snapshot-7408532d-9002-44b2-b06f-5d16be11100c: true
      created: "2025-04-28T18:03:30Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-db9f9f79-a475-4d2c-a111-f350f2d1aa0b
      parent: snapshot-278165cc-378c-4bc9-bf9f-8cd5a11af0da
      removed: false
      size: "17012461568"
      usercreated: true
    snapshot-f6574122-bf7a-4b23-a287-beba4540c4b6:
      children:
        snapshot-5a9ee446-756b-40f4-8a93-acbf847081ae: true
      created: "2025-04-28T09:00:17Z"
      labels:
        RecurringJob: snapshots
      name: snapshot-f6574122-bf7a-4b23-a287-beba4540c4b6
      parent: 1a4d11a5-3062-4f67-bbd5-9cf499249298
      removed: false
      size: "19462991872"
      usercreated: true
    volume-head:
      children: {}
      created: "2025-04-29T05:40:39Z"
      labels: {}
      name: volume-head
      parent: 930bb73f-d7ce-47e4-8a32-568638a8d2a8
      removed: false
      size: "7169769472"
      usercreated: false
  snapshotsError: ""
  started: true
  storageIP: 10.42.15.31
  unmapMarkSnapChainRemovedEnabled: true

pwurbs avatar Apr 29 '25 06:04 pwurbs

Kernel version: 4.18.0-553.50.1.el8_10.x86_64 There is also another volume with nearly 1.6TB. Here the 0% state only takes 10min. And the main intention of this ticket is to get to know, how the stuck state can be influenced, see original post.

The old kernel has slow ext4 extent retrieval issue. https://github.com/longhorn/longhorn/issues/2507#issuecomment-857195496 https://longhorn.io/docs/1.8.1/best-practices/

derekbit avatar Apr 29 '25 23:04 derekbit

Ok, thx. But again my points:

  • Would it make sense to show the progress during the init phase (stuck 0% state) in a better way?
  • What can we do to accelerate the init phase? --> reduce number of snapshots, reduce number of old backups, full instead of incremental backup...?

pwurbs avatar Apr 30 '25 06:04 pwurbs

  1. The first step of getting file extents map is a system call. We cannot have this information until the call return
  2. I think you can try to ugrade kernel to see if it is improved

PhanLe1010 avatar Apr 30 '25 06:04 PhanLe1010

Upgrading the Kernel means migrating to a newer OS. This is a major step for us. So, for the meanwhile we are searching for optimization opportunities. Maybe you know how the impact of the slow system call can be minimized, e.g. by reduce number of snapshots, reduce number of old backups, full instead of incremental backup...?

pwurbs avatar Apr 30 '25 06:04 pwurbs

I think it should be kept as incremental backup. Reducing the size of snapshot may help (a.k.a taking backup and deleting backup more often so size of volume head is small)

PhanLe1010 avatar May 07 '25 00:05 PhanLe1010

Thx @PhanLe1010 This confirms our observation. Would you also say that 2TiB is a magic threshold to enter into this issue? Meaning, the very long stuck state is only faced if the volume usage is above 2TiB? We have a bit this perception.

pwurbs avatar May 07 '25 09:05 pwurbs

Would you please answer my last question? Then, we can close the issue.

pwurbs avatar Jun 10 '25 07:06 pwurbs

Similar story here – backup stuck for days. This is Longhorn 1.8.2 with backup target over NFS. It seems like manually triggered backups work fine.

apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  creationTimestamp: "2025-07-08T13:00:11Z"
  finalizers:
  - longhorn.io
  generation: 1
  labels:
    backup-target: default
    backup-volume: pvc-5fca5355-d722-4560-9d5f-6d28e477edc9
  name: backup-d8238add5a944a6e
  namespace: longhorn-system
  resourceVersion: "1891241784"
  uid: 9d73e807-ae7c-41d9-8d8a-e4ca91d0b25a
spec:
  backupMode: incremental
  labels:
    KubernetesStatus: '{"pvName":"pvc-5fca5355-d722-4560-9d5f-6d28e477edc9","pvStatus":"Bound","namespace":"dockerhub-proxy","pvcName":"dockerhub-proxy-docker-registry","lastPVCRefAt":"","workloadsStatus":[{"podName":"dockerhub-proxy-docker-registry-775c964fd7-tzxwc","podStatus":"Running","workloadName":"dockerhub-proxy-docker-registry-775c964fd7","workloadType":"ReplicaSet"}],"lastPodRefAt":""}'
    RecurringJob: minio
    longhorn.io/volume-access-mode: rwo
  snapshotName: minio-c-120067cd-4ef9-406a-ac7f-862f129f5a74
  syncRequestedAt: null
status:
  backupCreatedAt: ""
  backupTargetName: ""
  compressionMethod: ""
  labels: null
  lastSyncedAt: null
  messages: {}
  newlyUploadDataSize: ""
  ownerID: elnath
  progress: 0
  reUploadedDataSize: ""
  replicaAddress: tcp://10.42.0.209:10010
  size: ""
  snapshotCreatedAt: "2025-07-08T13:00:01Z"
  snapshotName: minio-c-120067cd-4ef9-406a-ac7f-862f129f5a74
  state: InProgress
  url: ""
  volumeBackingImageName: ""
  volumeCreated: ""
  volumeName: ""
  volumeSize: "107374182400"

moubctez avatar Jul 09 '25 07:07 moubctez

@moubctez Can you provide a support bundle? cc @COLDTURNIP @mantissahz

derekbit avatar Jul 09 '25 07:07 derekbit