k8ssandra-operator
k8ssandra-operator copied to clipboard
MedusaBackup may have `status.finishTime` set to "1970-01-01T00:00:00Z" while backup is in progress
What happened?
A MedusaBackup
CR may have the field status.finishedTime
set to "1970-01-01T00:00:00Z"
while the backup is still in progress.
According to https://docs.k8ssandra.io/tasks/backup-restore/#checking-backup-completion , a backup is considered completed when status.finishedTime
is set.
Did you expect to see something different?
I expected the field status.finishedTime
to not be set while backup is in progress, as stated by documentation.
How to reproduce it (as minimally and precisely as possible):
- Deploy k8ssandra cluster
- Create a
MedusaBackupJob
- Create a
MedusaTask
in sync mode while the backup job is in pogress
If the backup sync is completed before the backup job is completed, the issue will happen.
Environment
-
K8ssandra Operator version: 1.14
-
Kubernetes version information: 1.27 and 1.29
-
Kubernetes cluster kind: EKS, AKS and GKE
-
K8ssandra Operator Logs:
insert K8ssandra Operator logs relevant to the issue here
Anything else we need to know?:
I'm pretty sure the issue is caused by this line https://github.com/k8ssandra/k8ssandra-operator/blob/main/controllers/medusa/medusatask_controller.go#L328
It will set finishTime
to the zero unix timestamp (1970-01-01
)
however, the zero metav1.Time
timestamp is not 1970-01-01
but 0001-01-01
(January 1, year 1, 00:00:00 UTC)
- Manifests:
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaBackup
metadata:
creationTimestamp: "2024-04-26T17:00:04Z"
generation: 1
name: k8scass-cs-001-k8scass-001-k9sbq
namespace: eksdoucy
resourceVersion: "5686257"
uid: b793f159-5cf0-48b9-9dd5-23f63e49e93f
spec:
backupType: differential
cassandraDatacenter: k8scass-001
status:
finishTime: "1970-01-01T00:00:00Z"
finishedNodes: 1
nodes:
- datacenter: k8scass-001
host: k8scass-cs-001-k8scass-001-default-sts-1
rack: default
tokens:
- -2589360646730023400
- -3653701636503086600
- -4912122227646739000
- -7033817127648975000
- -769356345750076800
- -8091193644577933000
- 1107707983762909200
- 2276662906941923600
- 3325786707271115000
- 4261699880062630000
- 4677337539634270000
- 5889017886306251000
- 6286340738654964000
- 7544761454600407000
- 8579062101275812000
- 9005348602556940000
- datacenter: k8scass-001
host: k8scass-cs-001-k8scass-001-default-sts-0
rack: default
tokens:
- -1364658714710476000
- -2182330647322826800
- -288902833484896500
- -3200182052065254000
- -4271682668405287000
- -5711950898307870000
- -6556327545799032000
- -7647624656491244000
- -8743666556726182000
- 1526385666673598700
- 2769905672354191000
- 3737269915432726000
- 447012859225997100
- 5327274467946625000
- 6946297763643668000
- 7924443216159782000
- datacenter: k8scass-001
host: k8scass-cs-001-k8scass-001-default-sts-2
rack: default
tokens:
- -1689492637652962000
- -2827619177042857000
- -3889971576713209000
- -4552103083015450000
- -5216611785192708000
- -6043458241930822000
- -7285896556610953000
- -8343453241301002000
- -9010383990221943000
- 142423091302120580
- 1973957006571479000
- 5045680731261046000
- 6678161695265310000
- 7311145518738443000
- 8289036375131932000
- 856718555699781500
startTime: "2024-04-26T16:59:13Z"
status: IN_PROGRESS
totalNodes: 3
kind: List
metadata:
resourceVersion: ""
apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaRestoreJob
metadata:
creationTimestamp: "2024-04-26T17:10:41Z"
generateName: k8scass-cs-001-k8scass-001-k9sbq-
generation: 1
labels:
app: c3aiops
ops.c3.ai/parent-resource: eksdoucy-c3cassandrarestore-k8scass-cs-001-6pr92
role: c3aiops-C3CassandraRestore
name: k8scass-cs-001-k8scass-001-k9sbq-rh4hx
namespace: eksdoucy
ownerReferences:
- apiVersion: cassandra.datastax.com/v1beta1
blockOwnerDeletion: true
controller: true
kind: CassandraDatacenter
name: k8scass-001
uid: 2ad0341f-cb57-425e-9a3f-533a31a83152
resourceVersion: "5691659"
uid: 596f85ce-01da-464a-a666-482f330090ae
spec:
backup: k8scass-cs-001-k8scass-001-k9sbq
cassandraDatacenter: k8scass-001
status:
finishTime: "2024-04-26T17:10:56Z"
message: target backup has not completed successfully
restoreKey: 6dca1918-1fdf-47dd-86b1-3f29f0b14c9b
restoreMapping: {}
startTime: "2024-04-26T17:10:56Z"
kind: List
metadata:
resourceVersion: ""
Hello,
I have the same issue on several clusters. The MedusaBackupJob
are triggered by a MedusaBackupSchedule
on my side.
Additionally, the following MedusaBackupJob
are not triggered anymore.
Could it be related to the last MedusaBackup
status ?
Here are some logs to help debug the issue.
In the k8ssandra-operator logs, it says that the still "IN_PROGRESS" (but with finished time set to 1970-01-01T00:00:00Z
) MedusaBackupJob
is blocking the next backup schedule:
2024-05-02T07:46:58.393Z DEBUG Postponing backup schedule due to existing active backups {"controller": "medusabackupschedule", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupSchedule", "MedusaBackupSchedule": {"name":"medusa-backup-schedule-dc1","namespace":"cassandra"}, "namespace": "cassandra", "name": "medusa-backup-schedule-dc1", "reconcileID": "7e23a08a-864f-40dd-bb88-47195370663e", "medusabackupscheduler": "cassandra/medusa-backup-schedule-dc1", "MedusaBackupSchedule": {"namespace": "cassandra", "name": "medusa-backup-schedule-dc1"}}
In my case there is only one node which did not finished the backup, and here are the logs of the medusa container which are repeating:
[2024-05-02 07:58:46,167] INFO: Using service file: /etc/medusa-secrets/credentials
[2024-05-02 07:58:46,167] DEBUG: Connecting to Google Storage
[2024-05-02 07:58:46,168] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql was not found in cache.
[2024-05-02 07:58:46,168] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql
[2024-05-02 07:58:46,325] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/schema.cql last modification time is 2024-04-06 02:00:16.517000
[2024-05-02 07:58:46,326] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,326] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,348] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,348] DEBUG: Registered backup name medusa-backup-schedule-dc1-1712368800 found existing, replacing with new
[2024-05-02 07:58:46,348] DEBUG: Cancelling backup` id: medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] INFO: Registered backup id medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,348] DEBUG: Blob main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json was not found in cache.
[2024-05-02 07:58:46,349] DEBUG: [Storage] Getting object main-cluster/main-cluster-dc1-rack1-sts-2/medusa-backup-schedule-dc1-1712368800/meta/manifest.json
[2024-05-02 07:58:46,372] DEBUG: No manifest blob for backup medusa-backup-schedule-dc1-1712368800 of fqdn main-cluster-dc1-rack1-sts-2
[2024-05-02 07:58:46,373] DEBUG: Updated from existing status: -1 to new status: 0 for backup id: medusa-backup-schedule-dc1-1712368800
[2024-05-02 07:58:46,373] DEBUG: Disconnecting from Google Storage
@rzvoncek, did we change something recently that would set the finishTime where it wasn't the case before?
For information, deleting the MedusaBackupJob
stucked with IN_PROGRESS
status triggers a new MedusaBackupJob
from the MedusaBackupSchedule
.
Hi. I tried to reproduce this, but I could not.
On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.
It might be that I'd need a MedusaBackupJob
stuck in IN_PROGRESS
, which currently happens a lot because of a bug we fixed here. I think I'll go ahead and try to reproduce this as well.
Hi. I tried to reproduce this, but I could not.
On a fresh deployment, I tried starting the sync task always after creating the MedusaBackupJob, but sometimes before that started (got the start timestamp) and sometimes after. I did this on the current main, and 1.14 k8ssandra operator. What I saw, in both cases, was that doing the sync before the start timestamp appeared broke the eventual MedusaBackup, which did not correctly show file count and size.
What is the number of replicas of you Cassandra datacenter? I observer that with 1 replica, the issue is hard to reproduce, but with 3 replicas, the issue happened consistently (since backup job will take longer)
I've got the same issue on 6 node cluster. trying to fix it by cleaning all medusa tasks, backup jobs and schedules, snapshots etc. no luck atm
Hi folks, the only place where this could happen in the code is here.
If we have a 0 value for the finishTime, it's going to convert to 1970-01-01T00:00:00Z
.
Nonetheless, this means the sync is running concurrently with a backup which isn't finished yet, and creating that MedusaBackup object.
It's most probable that the sync should only locally persist completed backups, since there's nothing to do anyway with incomplete ones.
I'll prepare a PR for this, so that if we don't have a finish time we don't sync the MedusaBackup object.