linstor-csi
linstor-csi copied to clipboard
NumberFormatExceptions followed by "snapshot name is in use for volume ..."
I'm trying to setup snapshot shipping via CSI (https://github.com/piraeusdatastore/linstor-csi/pull/142) without much success. Most recently I had a flood of errors around parse failures, but there's not enough detail to figure out what it's failing to parse. I thought maybe it had something to do with my VolumeClassName ("linstor.minio") having a period in the name maybe throwing off parsing, but I was able to create another backup (from a significantly smaller volume) using the same class. This all might be a red herring...
It seems after these NumberFormatExceptions happen the system gets into a mixed up state as you can see below -- namely it tries to re-assign an existing snapshot name to a different PVC.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
creationTimestamp: "2022-08-29T08:31:01Z"
finalizers:
- snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
- snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
generateName: velero-vmsingle-vm-victoria-metrics-k8s-stack-
generation: 1
labels:
velero.io/backup-name: linstor-20220829080004
name: velero-vmsingle-vm-victoria-metrics-k8s-stack-844m6
namespace: prometheus
resourceVersion: "56560018"
uid: 17660871-6cf4-4d07-8c11-8c9f5922625a
spec:
source:
persistentVolumeClaimName: vmsingle-vm-victoria-metrics-k8s-stack
volumeSnapshotClassName: linstor.minio
status:
boundVolumeSnapshotContentName: snapcontent-17660871-6cf4-4d07-8c11-8c9f5922625a
error:
message: 'Failed to check and update snapshot content: failed to take snapshot
of the volume pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf: "rpc error: code = AlreadyExists
desc = can''t use \"snapshot-17660871-6cf4-4d07-8c11-8c9f5922625a\" for snapshot
name for volume \"pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf\", snapshot name
is in use for volume \"pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93\""'
time: "2022-08-29T14:58:59Z"
readyToUse: false
pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93 is indeed another volume that was snapshotted around the same time as this one. Is there perhaps a race condition where creating multiple snapshots around the same time can cause names to not be unique?
# linstor err l
| 630C4D06-00000-000045 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000046 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000047 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000048 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000049 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000050 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000051 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000052 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000053 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000054 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000055 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000056 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000057 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000058 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000059 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000060 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000061 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000062 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000063 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000064 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000065 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000066 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000067 | 2022-08-29 14:26:04 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: For input string: ".2022022E4.2022022E4" |
| 630C4D06-00000-000068 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000069 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000070 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000071 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000072 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000073 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000074 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
| 630C4D06-00000-000075 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points |
# linstor err show (multiple points)
ERROR REPORT 630C4D06-00000-000066
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.19.1
Build ID: a758bf07796c374fd2004465b0d8690209b74356
Build time: 2022-07-27T06:36:54+00:00
Error time: 2022-08-29 14:21:52
Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NumberFormatException
Class canonical name: java.lang.NumberFormatException
Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #1890
Error message: multiple points
Call backtrace:
Method Native Class:Line number
readJavaFormatString N jdk.internal.math.FloatingDecimal:1890
parseDouble N jdk.internal.math.FloatingDecimal:110
parseDouble N java.lang.Double:543
getDouble N java.text.DigitList:169
parse N java.text.DecimalFormat:2126
subParse N java.text.SimpleDateFormat:1931
parse N java.text.SimpleDateFormat:1541
parse N java.text.DateFormat:393
<init> N com.linbit.linstor.backupshipping.S3MetafileNameInfo:46
listBackups N com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679
lambda$listBackups$0 N com.linbit.linstor.api.rest.v1.Backups:304
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:231
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:203
listBackups N com.linbit.linstor.api.rest.v1.Backups:298
invoke N jdk.internal.reflect.GeneratedMethodAccessor194:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:566
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:469
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:391
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:80
run N org.glassfish.jersey.server.ServerRuntime$1:253
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:232
handle N org.glassfish.jersey.server.ApplicationHandler:680
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:200
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
run N java.lang.Thread:829
END OF ERROR REPORT.
# linstor err show (For input string)
ERROR REPORT 630C4D06-00000-000067
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.19.1
Build ID: a758bf07796c374fd2004465b0d8690209b74356
Build time: 2022-07-27T06:36:54+00:00
Error time: 2022-08-29 14:26:04
Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NumberFormatException
Class canonical name: java.lang.NumberFormatException
Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #2054
Error message: For input string: ".2022022E4.2022022E4"
Call backtrace:
Method Native Class:Line number
readJavaFormatString N jdk.internal.math.FloatingDecimal:2054
parseDouble N jdk.internal.math.FloatingDecimal:110
parseDouble N java.lang.Double:543
getDouble N java.text.DigitList:169
parse N java.text.DecimalFormat:2126
subParse N java.text.SimpleDateFormat:1931
parse N java.text.SimpleDateFormat:1541
parse N java.text.DateFormat:393
<init> N com.linbit.linstor.backupshipping.S3MetafileNameInfo:46
listBackups N com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679
lambda$listBackups$0 N com.linbit.linstor.api.rest.v1.Backups:304
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:231
doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:203
listBackups N com.linbit.linstor.api.rest.v1.Backups:298
invoke N jdk.internal.reflect.GeneratedMethodAccessor194:unknown
invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43
invoke N java.lang.reflect.Method:566
lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:469
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:391
apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:80
run N org.glassfish.jersey.server.ServerRuntime$1:253
call N org.glassfish.jersey.internal.Errors$1:248
call N org.glassfish.jersey.internal.Errors$1:244
process N org.glassfish.jersey.internal.Errors:292
process N org.glassfish.jersey.internal.Errors:274
process N org.glassfish.jersey.internal.Errors:244
runInScope N org.glassfish.jersey.process.internal.RequestScope:265
process N org.glassfish.jersey.server.ServerRuntime:232
handle N org.glassfish.jersey.server.ApplicationHandler:680
service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
run N org.glassfish.grizzly.http.server.HttpHandler$1:200
doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
run N java.lang.Thread:829
END OF ERROR REPORT.
Weird. Since the error seems to be in listBackups
(which gets called by the CSI driver on creation) I assume it's not actually about any snapshot it tries to create, but about what files already exist in the bucket. This looks like an issue that needs to be fixed upstream: https://github.com/LINBIT/linstor-server/issues
Running with --log-level=debug
shed a little more light on this...
time="2022-09-02T02:10:03Z" level=debug msg="no snapshot matching id found, trying backups" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1367" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/s3'"
time="2022-09-02T02:10:03Z" level=debug msg="listing backups in remote" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1377" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client remote=minio
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/minio/backups?snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba'"
time="2022-09-02T02:10:03Z" level=debug msg="found existing snapshot" func=github.com/piraeusdatastore/linstor-csi/pkg/driver.Driver.CreateSnapshot file="/src/pkg/driver/driver.go:965" existingSnapshot="snapshot_id:\"snapshot-dc295533-ba78-4e3f-b110-54175f1386f1\" source_volume_id:\"pvc-621a06d8-0503-4cf3-84af-d75e1bf40508\" creation_time:<seconds:1662034561 > " linstorCSIComponent=driver nodeID=pi4c provisioner=linstor.csi.linbit.com requestedSnapshotName=snapshot-1a9289d2-b984-4527-baf4-522b42908e0e requestedSnapshotSourceVolume=pvc-d2ec0c0d-f114-4fe2-9c30-a56279e356b6 version=v0.20.0-d514e41db7cdcb580769cc69f1c1ef2b8a5def5d
That last line is interesting... It queries for a backup with snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba
and it gets a result... but that result is a totally different one (snapshot-dc295533-ba78-4e3f-b110-54175f1386f1
) than the one it asked for?
Furthermore the snapshot returned is unrecoverable. That should at least unblock me (I can delete it), although I'm surprised it wasn't deleted/cleaned up automatically (it was totally hidden in the CLI, which is as good as deleted, anyway).
root@piraeus-op-cs-controller-5b8ff7f4c8-nqns5:/# linstor backup l minio -s snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba -p
+------------------------------------------------------------------------------------------------------------------------------------+
| Resource | Snapshot | Finished at | Based On | Status |
|====================================================================================================================================|
| pvc-621a06d8-0503-4cf3-84af-d75e1bf40508 | snapshot-dc295533-ba78-4e3f-b110-54175f1386f1 | | | Not restorable |
+------------------------------------------------------------------------------------------------------------------------------------+
What's still confusing me (and where there's maybe room for the CSI to improve) is why this UUID keeps getting recycled for so many different volumes. I know very little about CSI, but naively it seems like each snapshot request should have its own UUID, making collisions like this impossible. The only way I can see that happening is if CompatibleSnapshotId
gets the same invalid name for multiple snapshots... some of my PVCs do have identical names, if that's relevant? Is there any downside to adding entropy to the fallback ID that's generated?
@WanzenBug take a look at https://github.com/LINBIT/linstor-server/pull/314 -- the presence of any backups in the cluster completely breaks snapshotting.