Kubernetes and Linode disagree about volume state
Bug Reporting
Running a Kubernetes cluster and using persistent storage has inconsistency with the console/cli output for volumes. I have three volumes that Kubernetes show as Bound and Linode shows as Unattached.
Expected Behavior
For kubectl get pv and linode-cli volumes list to agree.
Actual Behavior
Some volumes which are in a Bound state in Kubernetes show as Unattached in Linode.
Steps to Reproduce the Problem
- Create Kubernetes Cluster
- Deploy pods that use PV/PVC
- Hope it happens
Environment Specifications
Kubernetes Version v1.15.3 CRICTL Version v1.15.0 CNI Version v0.8.2
Screenshots, Code Blocks, and Logs
$ k -n data-lake get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-data-lake-consul-shared-consul-server-0 Bound pvc-2a75786321a54dc7 10Gi RWO linode-block-storage 5d1h
data-data-lake-consul-shared-consul-server-1 Bound pvc-c527988751224331 10Gi RWO linode-block-storage 5d1h
data-data-lake-consul-shared-consul-server-2 Bound pvc-07349048faf54fc2 10Gi RWO linode-block-storage 5d1h
data-redis-shared-redis-ha-server-0 Bound pvc-e7d2504da0174c4e 10Gi RWO linode-block-storage 4d5h
data-redis-shared-redis-ha-server-1 Bound pvc-2ff4f633fd044b7f 10Gi RWO linode-block-storage 4d5h
data-redis-shared-redis-ha-server-2 Bound pvc-7a2a92483c804a47 10Gi RWO linode-block-storage 4d5h
datadir-cockroachdb-shared-cockroachdb-0 Bound pvc-6835ba194ace47f8 10Gi RWO linode-block-storage 4d5h
datadir-cockroachdb-shared-cockroachdb-1 Bound pvc-71ad3db47d4d4b25 10Gi RWO linode-block-storage 4d5h
datadir-cockroachdb-shared-cockroachdb-2 Bound pvc-306dce8bcd7a4c0b 10Gi RWO linode-block-storage 4d5h
datadir-consul-dev-0 Bound pvc-c7f54907a7ee4bd2 10Gi RWO linode-block-storage 5d1h
datadir-consul-dev-1 Bound pvc-7a9423257d7b4e71 10Gi RWO linode-block-storage 5d1h
datadir-consul-dev-2 Bound pvc-d9043b0bb18a45c5 10Gi RWO linode-block-storage 5d1h
$ k -n data-lake get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-07349048faf54fc2 10Gi RWO Delete Bound data-lake/data-data-lake-consul-shared-consul-server-2 linode-block-storage 5d1h
pvc-2a75786321a54dc7 10Gi RWO Delete Bound data-lake/data-data-lake-consul-shared-consul-server-0 linode-block-storage 5d1h
pvc-2ff4f633fd044b7f 10Gi RWO Delete Bound data-lake/data-redis-shared-redis-ha-server-1 linode-block-storage 4d5h
pvc-306dce8bcd7a4c0b 10Gi RWO Delete Bound data-lake/datadir-cockroachdb-shared-cockroachdb-2 linode-block-storage 4d5h
pvc-6835ba194ace47f8 10Gi RWO Delete Bound data-lake/datadir-cockroachdb-shared-cockroachdb-0 linode-block-storage 4d5h
pvc-71ad3db47d4d4b25 10Gi RWO Delete Bound data-lake/datadir-cockroachdb-shared-cockroachdb-1 linode-block-storage 4d5h
pvc-7a2a92483c804a47 10Gi RWO Delete Bound data-lake/data-redis-shared-redis-ha-server-2 linode-block-storage 4d5h
pvc-7a9423257d7b4e71 10Gi RWO Delete Bound data-lake/datadir-consul-dev-1 linode-block-storage 5d1h
pvc-c527988751224331 10Gi RWO Delete Bound data-lake/data-data-lake-consul-shared-consul-server-1 linode-block-storage 5d1h
pvc-c7f54907a7ee4bd2 10Gi RWO Delete Bound data-lake/datadir-consul-dev-0 linode-block-storage 5d1h
pvc-d9043b0bb18a45c5 10Gi RWO Delete Bound data-lake/datadir-consul-dev-2 linode-block-storage 5d1h
pvc-e7d2504da0174c4e 10Gi RWO Delete Bound data-lake/data-redis-shared-redis-ha-server-0 linode-block-storage 4d5h
$ linode-cli volumes list
┌───────┬─────────────────────┬────────┬──────┬────────────┬───────────┐
│ id │ label │ status │ size │ region │ linode_id │
├───────┼─────────────────────┼────────┼──────┼────────────┼───────────┤
│ 42891 │ pvcc7f54907a7ee4bd2 │ active │ 10 │ us-central │ │
│ 42893 │ pvc7a9423257d7b4e71 │ active │ 10 │ us-central │ │
│ 42894 │ pvcd9043b0bb18a45c5 │ active │ 10 │ us-central │ │
│ 42895 │ pvc2a75786321a54dc7 │ active │ 10 │ us-central │ 15796976 │
│ 42896 │ pvc07349048faf54fc2 │ active │ 10 │ us-central │ 15796979 │
│ 42897 │ pvcc527988751224331 │ active │ 10 │ us-central │ 15796975 │
│ 42983 │ pvc682b97ea2b94467e │ active │ 10 │ us-central │ │
│ 42992 │ pvce7d2504da0174c4e │ active │ 10 │ us-central │ 15796976 │
│ 42993 │ pvc2ff4f633fd044b7f │ active │ 10 │ us-central │ 15796975 │
│ 42994 │ pvc7a2a92483c804a47 │ active │ 10 │ us-central │ 15796979 │
│ 42999 │ pvc6835ba194ace47f8 │ active │ 10 │ us-central │ 15796976 │
│ 43000 │ pvc71ad3db47d4d4b25 │ active │ 10 │ us-central │ 15796979 │
│ 43001 │ pvc306dce8bcd7a4c0b │ active │ 10 │ us-central │ 15796975 │
└───────┴─────────────────────┴────────┴──────┴────────────┴───────────┘
The ones without a linode_id show up as Unattached in the web console. One of them isn't actually attached to anything, but as you can see, three of them are.
Additional Notes
The Linode Community is a great place to get additional support.
Thanks for reporting this. I saw a similar issue to this recently where volumes fail to get detached from a node and reattached to a target node when a pod changes nodes, causing downtime while the pod waits to come back up. Usually when this happens I get a flood of events that a volume has been detached, but only one saying that it's been attached. I'll try and look into this.
I have yet to determine if they are actually properly mounted on the Kubernetes nodes but none of my apps seem to be mad at me so I'm assuming they are actually attached. I can verify on Friday when I have a minute.
In my experience, the volumes do eventually get mounted where they should be, but a failure of a volume to hop nodes is less worrying than an outright mismatch between observed and expected state...hopefully it'll be clear once I actually start looking at the code.
Is there a discrepancy in how persist_across_boot attached volumes are reported in volume lists?
~the problem could be here in NodeUnstageVolume. It looks like we're attempting to unmount the volume again instead of detach. I'm going to work on a PR for this~ After some deeper investigation this doesn't appear to be the issue; the volume is detached by the controllerserver
Sorry for this issue having sat here for so long, unattended.
If this issue is still valid, please feel free to re-open the issue.