k8ssandra-operator icon indicating copy to clipboard operation
k8ssandra-operator copied to clipboard

MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress

Open c3-clement opened this issue 9 months ago • 8 comments

What happened?

A MedusaBackup CR may have the field status.finishedTime set to "1970-01-01T00:00:00Z" while the backup is still in progress.

image

According to https://docs.k8ssandra.io/tasks/backup-restore/#checking-backup-completion , a backup is considered completed when status.finishedTime is set.

Did you expect to see something different?

I expected the field status.finishedTime to not be set while backup is in progress, as stated by documentation.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy k8ssandra cluster
  2. Create a MedusaBackupJob
  3. Create a MedusaTask in sync mode while the backup job is in pogress

If the backup sync is completed before the backup job is completed, the issue will happen.

Environment

  • K8ssandra Operator version: 1.14

  • Kubernetes version information: 1.27 and 1.29

  • Kubernetes cluster kind: EKS, AKS and GKE

  • K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here

Anything else we need to know?:

I'm pretty sure the issue is caused by this line https://github.com/k8ssandra/k8ssandra-operator/blob/main/controllers/medusa/medusatask_controller.go#L328

It will set finishTime to the zero unix timestamp (1970-01-01) however, the zero metav1.Time timestamp is not 1970-01-01 but 0001-01-01 (January 1, year 1, 00:00:00 UTC)

  • Manifests:
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaBackup
  metadata:
    creationTimestamp: "2024-04-26T17:00:04Z"
    generation: 1
    name: k8scass-cs-001-k8scass-001-k9sbq
    namespace: eksdoucy
    resourceVersion: "5686257"
    uid: b793f159-5cf0-48b9-9dd5-23f63e49e93f
  spec:
    backupType: differential
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "1970-01-01T00:00:00Z"
    finishedNodes: 1
    nodes:
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-1
      rack: default
      tokens:
      - -2589360646730023400
      - -3653701636503086600
      - -4912122227646739000
      - -7033817127648975000
      - -769356345750076800
      - -8091193644577933000
      - 1107707983762909200
      - 2276662906941923600
      - 3325786707271115000
      - 4261699880062630000
      - 4677337539634270000
      - 5889017886306251000
      - 6286340738654964000
      - 7544761454600407000
      - 8579062101275812000
      - 9005348602556940000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-0
      rack: default
      tokens:
      - -1364658714710476000
      - -2182330647322826800
      - -288902833484896500
      - -3200182052065254000
      - -4271682668405287000
      - -5711950898307870000
      - -6556327545799032000
      - -7647624656491244000
      - -8743666556726182000
      - 1526385666673598700
      - 2769905672354191000
      - 3737269915432726000
      - 447012859225997100
      - 5327274467946625000
      - 6946297763643668000
      - 7924443216159782000
    - datacenter: k8scass-001
      host: k8scass-cs-001-k8scass-001-default-sts-2
      rack: default
      tokens:
      - -1689492637652962000
      - -2827619177042857000
      - -3889971576713209000
      - -4552103083015450000
      - -5216611785192708000
      - -6043458241930822000
      - -7285896556610953000
      - -8343453241301002000
      - -9010383990221943000
      - 142423091302120580
      - 1973957006571479000
      - 5045680731261046000
      - 6678161695265310000
      - 7311145518738443000
      - 8289036375131932000
      - 856718555699781500
    startTime: "2024-04-26T16:59:13Z"
    status: IN_PROGRESS
    totalNodes: 3
kind: List
metadata:
  resourceVersion: ""
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaRestoreJob
  metadata:
    creationTimestamp: "2024-04-26T17:10:41Z"
    generateName: k8scass-cs-001-k8scass-001-k9sbq-
    generation: 1
    labels:
      app: c3aiops
      ops.c3.ai/parent-resource: eksdoucy-c3cassandrarestore-k8scass-cs-001-6pr92
      role: c3aiops-C3CassandraRestore
    name: k8scass-cs-001-k8scass-001-k9sbq-rh4hx
    namespace: eksdoucy
    ownerReferences:
    - apiVersion: cassandra.datastax.com/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CassandraDatacenter
      name: k8scass-001
      uid: 2ad0341f-cb57-425e-9a3f-533a31a83152
    resourceVersion: "5691659"
    uid: 596f85ce-01da-464a-a666-482f330090ae
  spec:
    backup: k8scass-cs-001-k8scass-001-k9sbq
    cassandraDatacenter: k8scass-001
  status:
    finishTime: "2024-04-26T17:10:56Z"
    message: target backup has not completed successfully
    restoreKey: 6dca1918-1fdf-47dd-86b1-3f29f0b14c9b
    restoreMapping: {}
    startTime: "2024-04-26T17:10:56Z"
kind: List
metadata:
  resourceVersion: ""

c3-clement avatar Apr 29 '24 12:04 c3-clement

Hello,

I have the same issue on several clusters. The MedusaBackupJob are triggered by a MedusaBackupSchedule on my side. Additionally, the following MedusaBackupJob are not triggered anymore. Could it be related to the last MedusaBackup status ?

mlallaouret avatar Apr 30 '24 15:04 mlallaouret

Here are some logs to help debug the issue.

In the k8ssandra-operator logs, it says that the still "IN_PROGRESS" (but with finished time set to 1970-01-01T00:00:00Z ) MedusaBackupJob is blocking the next backup schedule:

2024-05-02T07:46:58.393Z	DEBUG	Postponing backup schedule due to existing active backups	{"controller": "medusabackupschedule", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupSchedule", "MedusaBackupSchedule": {"name":"medusa-backup-schedule-dc1","namespace":"cassandra"}, "namespace": "cassandra", "name": "medusa-backup-schedule-dc1", "reconcileID": "7e23a08a-864f-40dd-bb88-47195370663e", "medusabackupscheduler": "cassandra/medusa-backup-schedule-dc1", "MedusaBackupSchedule": {"namespace": "cassandra", "name": "medusa-backup-schedule-dc1"}}

In my case there is only one node which did not finished the backup, and here are the logs of the medusa container which are repeating:

[2024-05-02 07:58:46,167] INFO: Using service file: /etc/medusa-secrets/credentials
[2024-05-02 07:58:46,167] DEBUG: Connecting to Google Storage
[2024-05-02 07:58:46,168] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql was not found in cache.
[2024-05-02 07:58:46,168] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql
[2024-05-02 07:58:46,325] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql last modification time is 2024-04-06 02:00:16.517000
[2024-05-02 07:58:46,326] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,326] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,348] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,348] DEBUG: Registered backup name medusa-backup-schedule-dc1-1712368800 found existing, replacing with new
[2024-05-02 07:58:46,348] DEBUG: Cancelling backup` id: medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] INFO: Registered backup id medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,349] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,372] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,373] DEBUG: Updated from existing status: -1 to new status: 0 for backup id: medusa-backup-schedule-dc1-1712368800 
[2024-05-02 07:58:46,373] DEBUG: Disconnecting from Google Storage

mlallaouret avatar May 02 '24 08:05 mlallaouret

@rzvoncek, did we change something recently that would set the finishTime where it wasn't the case before?

adejanovski avatar May 02 '24 08:05 adejanovski

For information, deleting the MedusaBackupJob stucked with IN_PROGRESS status triggers a new MedusaBackupJob from the MedusaBackupSchedule.

mlallaouret avatar May 02 '24 15:05 mlallaouret

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

It might be that I'd need a MedusaBackupJob stuck in IN_PROGRESS, which currently happens a lot because of a bug we fixed here. I think I'll go ahead and try to reproduce this as well.

rzvoncek avatar May 07 '24 11:05 rzvoncek

Hi. I tried to reproduce this, but I could not.

On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.

What is the number of replicas of you Cassandra datacenter? I observer that with 1 replica, the issue is hard to reproduce, but with 3 replicas, the issue happened consistently (since backup job will take longer)

c3-clement avatar May 07 '24 11:05 c3-clement

I've got the same issue on 6 node cluster. trying to fix it by cleaning all medusa tasks, backup jobs and schedules, snapshots etc. no luck atm

Pylypp avatar May 27 '24 11:05 Pylypp

Hi folks, the only place where this could happen in the code is here. If we have a 0 value for the finishTime, it's going to convert to 1970-01-01T00:00:00Z.

Nonetheless, this means the sync is running concurrently with a backup which isn't finished yet, and creating that MedusaBackup object.

It's most probable that the sync should only locally persist completed backups, since there's nothing to do anyway with incomplete ones.

I'll prepare a PR for this, so that if we don't have a finish time we don't sync the MedusaBackup object.

adejanovski avatar May 28 '24 15:05 adejanovski