piraeus-operator
piraeus-operator copied to clipboard
Linstor looking for next version of DRBD on evacuate
Hi,
Just installed a new node, and when I tried evacuating an existing one I got this :
ERROR:
Description:
Node: 'talos-if5-jn6' has DRBD version 9.2.6, but version 9.2.7 (or higher) is required
Details:
Node(s): 'talos-if5-jn6', Resource: 'pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184'
Show reports:
linstor error-reports show 65EA25CA-00000-000000
The node still got marked as evacuating, but the new node didn't get any volumes. All of my nodes are using 9.2.6 since that's the only available version for Talos, so I don't know where it's getting 9.2.7 from. Any idea, is there a config somewhere I may have missed ?
Thanks
Looks to be related to storage pool mixing: https://github.com/LINBIT/linstor-server/commit/700cb6281e2dec7f24f446780fe7ba2b5179524d
So do you have a different type of storage pool on the new node?
I do not, the new node is a fresh install of Talos, the only thing it did is join the cluster (they're all control plane nodes, it's a small home cluster). There's nothing on it at all, just the DRBD 9.2.6 extension like the other nodes
talos-if5-jn6: user: warning: [2024-03-12T15:36:20.272953785Z]: [talos] [initramfs] enabling system extension drbd 9.2.6-v1.6.6
talos-if5-jn6: kern: warning: [2024-03-12T15:36:29.153182785Z]: drbd: loading out-of-tree module taints kernel.
talos-if5-jn6: kern: info: [2024-03-12T15:36:29.170650785Z]: drbd: initialized. Version: 9.2.6 (api:2/proto:86-122)
talos-if5-jn6: kern: info: [2024-03-12T15:36:29.171099785Z]: drbd: GIT-hash: 52144c0f90a0fb00df6a7d6714ec9034c7af7a28 build by @buildkitsandbox, 2024-03-06 12:26:31
talos-if5-jn6: kern: info: [2024-03-12T15:36:29.171838785Z]: drbd: registered as block device major 147
talos-if5-jn6: kern: info: [2024-03-12T15:36:29.178817785Z]: drbd: registered transport class 'tcp' (version:9.2.6)
Just to add details, I only have the one storage pool, with 5 volumes on it. I have been having a lot of issues with it (see #579) and I'm hoping the issue is a bad node, which I'm trying to replace
Sorry to be a pain here, but since the latest releases seem to suggest the pods are using drbd 9.2.8, is there any chance this is caused by a mismatch between the kernel module and the binary in the pods ?
One of my volumes got corrupted (because of that other issue after a node reboot, it wouldn't see it as a valid ext4 partition anymore) and I can't restore a backup because it won't let me re-create it with this same error. Kind of stuck here
No, for the user land tools (drbdadm, drbdsetup, etc...) it is all the same. Again, it may be some (uninteded) difference between node configurations. Could you share the output of linstor -m storage-pool list.
sure :
[
{
"stor_pools": [
{
"stor_pool_uuid": "32fab312-0c78-43a4-9e58-a50127faadb5",
"stor_pool_name": "DfltDisklessStorPool",
"node_name": "talos-00r-fu9",
"free_space_mgr_name": "talos-00r-fu9;DfltDisklessStorPool",
"free_space": {
"stor_pool_name": "DfltDisklessStorPool",
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807
},
"driver": "DisklessDriver",
"static_traits": [
{
"key": "SupportsSnapshots",
"value": "false"
}
]
},
{
"stor_pool_uuid": "5df20bc5-7fed-4f83-b8ac-b54b64f012bd",
"stor_pool_name": "DfltDisklessStorPool",
"node_name": "talos-fdm-9ig",
"free_space_mgr_name": "talos-fdm-9ig;DfltDisklessStorPool",
"free_space": {
"stor_pool_name": "DfltDisklessStorPool",
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807
},
"driver": "DisklessDriver",
"static_traits": [
{
"key": "SupportsSnapshots",
"value": "false"
}
]
},
{
"stor_pool_uuid": "ef6c433f-cd01-453d-a667-30f219ba93ac",
"stor_pool_name": "DfltDisklessStorPool",
"node_name": "talos-if5-jn6",
"free_space_mgr_name": "talos-if5-jn6;DfltDisklessStorPool",
"free_space": {
"stor_pool_name": "DfltDisklessStorPool",
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807
},
"driver": "DisklessDriver",
"static_traits": [
{
"key": "SupportsSnapshots",
"value": "false"
}
]
},
{
"stor_pool_uuid": "5091a678-d215-4df8-b694-db4b747a01af",
"stor_pool_name": "DfltDisklessStorPool",
"node_name": "talos-ozt-z3h",
"free_space_mgr_name": "talos-ozt-z3h;DfltDisklessStorPool",
"free_space": {
"stor_pool_name": "DfltDisklessStorPool",
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807
},
"driver": "DisklessDriver",
"static_traits": [
{
"key": "SupportsSnapshots",
"value": "false"
}
]
},
{
"stor_pool_uuid": "405cef60-8481-4256-846c-366917ea019c",
"stor_pool_name": "main-pool",
"node_name": "talos-00r-fu9",
"free_space_mgr_name": "talos-00r-fu9;main-pool",
"free_space": {
"stor_pool_name": "main-pool",
"free_capacity": 212224748,
"total_capacity": 248705384
},
"driver": "",
"static_traits": [
{
"key": "Provisioning",
"value": "Thin"
},
{
"key": "SupportsSnapshots",
"value": "true"
}
],
"props": [
{
"key": "Aux/piraeus.io/managed-by",
"value": "piraeus-operator"
},
{
"key": "Aux/piraeus.io/last-applied",
"value": "[\"Aux/piraeus.io/managed-by\",\"StorDriver/StorPoolName\"]"
},
{
"key": "StorDriver/StorPoolName",
"value": "/var/lib/piraeus-datastore/main-pool"
}
]
},
{
"stor_pool_uuid": "68c2c5f0-9d07-4daa-85a0-a3858f456fa9",
"stor_pool_name": "main-pool",
"node_name": "talos-fdm-9ig",
"free_space_mgr_name": "talos-fdm-9ig;main-pool",
"free_space": {
"stor_pool_name": "main-pool",
"free_capacity": 207370408,
"total_capacity": 248705384
},
"driver": "",
"static_traits": [
{
"key": "Provisioning",
"value": "Thin"
},
{
"key": "SupportsSnapshots",
"value": "true"
}
],
"props": [
{
"key": "Aux/piraeus.io/managed-by",
"value": "piraeus-operator"
},
{
"key": "Aux/piraeus.io/last-applied",
"value": "[\"Aux/piraeus.io/managed-by\",\"StorDriver/StorPoolName\"]"
},
{
"key": "StorDriver/StorPoolName",
"value": "/var/lib/piraeus-datastore/main-pool"
}
]
},
{
"stor_pool_uuid": "cd300917-84ac-4f49-8067-9bc7898ee1f4",
"stor_pool_name": "main-pool",
"node_name": "talos-if5-jn6",
"free_space_mgr_name": "talos-if5-jn6;main-pool",
"free_space": {
"stor_pool_name": "main-pool",
"free_capacity": 112252248,
"total_capacity": 123737088
},
"driver": "",
"static_traits": [
{
"key": "Provisioning",
"value": "Thin"
},
{
"key": "SupportsSnapshots",
"value": "true"
}
],
"props": [
{
"key": "Aux/piraeus.io/managed-by",
"value": "piraeus-operator"
},
{
"key": "Aux/piraeus.io/last-applied",
"value": "[\"Aux/piraeus.io/managed-by\",\"StorDriver/StorPoolName\"]"
},
{
"key": "StorDriver/StorPoolName",
"value": "/var/lib/piraeus-datastore/main-pool"
},
{
"key": "StorDriver/internal/AllocationGranularity",
"value": "1"
}
]
},
{
"stor_pool_uuid": "3273dc0a-72f2-4706-8588-b0986da0bd52",
"stor_pool_name": "main-pool",
"node_name": "talos-ozt-z3h",
"free_space_mgr_name": "talos-ozt-z3h;main-pool",
"free_space": {
"stor_pool_name": "main-pool",
"free_capacity": 82570412,
"total_capacity": 115922944
},
"driver": "",
"static_traits": [
{
"key": "Provisioning",
"value": "Thin"
},
{
"key": "SupportsSnapshots",
"value": "true"
}
],
"props": [
{
"key": "Aux/piraeus.io/managed-by",
"value": "piraeus-operator"
},
{
"key": "Aux/piraeus.io/last-applied",
"value": "[\"Aux/piraeus.io/managed-by\",\"StorDriver/StorPoolName\"]"
},
{
"key": "StorDriver/StorPoolName",
"value": "/var/lib/piraeus-datastore/main-pool"
}
]
}
]
}
]
I don't know if that's it, but the only difference I see is the new node seems to have StorDriver/internal/AllocationGranularity set to 1 somehow.
I can't seem to unset it either : The key 'StorDriver/internal/AllocationGranularity' is not whitelisted.
That is probably why LINSTOR thinks it must involve the storage pool mixing. I'll do a bit of digging on when this property is added.
Awesome, thank you very much.
For now I figured out that by marking that new node as evacuating, I could force my volume to be created on the other ones so I was able to restore my backup, all good for now no rush. Appreciated the help
Ok, it seems related to the "old" storage pools being created in a previous linstor version, with the new storage pool being created with LINSTOR >=1.26.
I guess there is some missing migration in LINSTOR that then causes LINSTOR to see different values for the granularity, so it runs into the storage pool mixing case.
As a workaround, here is a script that creates the property in the LINSTOR database:
#!/bin/sh
set -e
NODE="$(echo "$1" | tr a-z A-Z)"
POOL="$(echo "$2" | tr a-z A-Z)"
KEY="$(echo -n "/STORPOOLCONF/$NODE/$POOL:StorDriver/internal/AllocationGranularity" | sha256sum | cut -d " " -f 1)"
cat <<EOF
apiVersion: internal.linstor.linbit.com/v1-25-1
kind: PropsContainers
metadata:
name: $KEY
spec:
prop_key: StorDriver/internal/AllocationGranularity
prop_value: "1"
props_instance: /STORPOOLCONF/$NODE/$POOL
EOF
You can run it and apply the created resource, then restart the linstor controller.
bash script.sh talos-ozt-z3h main-pool | kubectl create -f -
Afterwards, evacuation will work also on between old and new nodes.
That did seem to work, thank you very much !
I think, I ran into the same issue on a 3-node-proxmox cluster.
One node (pve1) was evacuated and upgraded, while the other two are still using an older proxmox/linstor/drbd-module version. When I try to evacuate another node to update it, I get:
Node: 'pve3' has DRBD version 9.2.2, but version 9.2.7 (or higher) is required
I see "StorDriver/internal/AllocationGranularity": "16" on one node and "StorDriver/internal/AllocationGranularity": "8" on the other two nodes.
I don't really understand what your script (@WanzenBug) does - it is Kubernetes specific, I'd guess?
Any hints on how to solve this?
pve1 ~ # linstor -m storage-pool list
[
[
{
"storage_pool_name": "DfltDisklessStorPool",
"node_name": "pve1",
"provider_kind": "DISKLESS",
"static_traits": {
"SupportsSnapshots": "false"
},
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807,
"free_space_mgr_name": "pve1;DfltDisklessStorPool",
"uuid": "0d9b1f16-c3a2-499c-a23e-fc20e6b157e0",
"supports_snapshots": false,
"external_locking": false
},
{
"storage_pool_name": "DfltDisklessStorPool",
"node_name": "pve2",
"provider_kind": "DISKLESS",
"static_traits": {
"SupportsSnapshots": "false"
},
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807,
"free_space_mgr_name": "pve2;DfltDisklessStorPool",
"uuid": "dcd4d766-9b50-4d86-85b7-4714aa967196",
"supports_snapshots": false,
"external_locking": false
},
{
"storage_pool_name": "DfltDisklessStorPool",
"node_name": "pve3",
"provider_kind": "DISKLESS",
"static_traits": {
"SupportsSnapshots": "false"
},
"free_capacity": 9223372036854775807,
"total_capacity": 9223372036854775807,
"free_space_mgr_name": "pve3;DfltDisklessStorPool",
"uuid": "d52a9bb1-f562-4dec-b326-21acf54c194d",
"supports_snapshots": false,
"external_locking": false
},
{
"storage_pool_name": "drbd_disk",
"node_name": "pve1",
"provider_kind": "ZFS",
"props": {
"StorDriver/StorPoolName": "zpool_disk_drbd",
"StorDriver/internal/AllocationGranularity": "16"
},
"static_traits": {
"Provisioning": "Fat",
"SupportsSnapshots": "true"
},
"free_capacity": 7344034195,
"total_capacity": 9361686528,
"free_space_mgr_name": "pve1;drbd_disk",
"uuid": "c448a177-5185-44fb-89ff-e81ade460277",
"supports_snapshots": true,
"external_locking": false
},
{
"storage_pool_name": "drbd_disk",
"node_name": "pve2",
"provider_kind": "ZFS",
"props": {
"StorDriver/StorPoolName": "zpool_disk_drbd",
"StorDriver/internal/AllocationGranularity": "8"
},
"static_traits": {
"Provisioning": "Fat",
"SupportsSnapshots": "true"
},
"free_capacity": 4304483389,
"total_capacity": 9361686528,
"free_space_mgr_name": "pve2;drbd_disk",
"uuid": "cb73c1f4-baae-4a59-90c1-9cfa8ff9a934",
"supports_snapshots": true,
"external_locking": false
},
{
"storage_pool_name": "drbd_disk",
"node_name": "pve3",
"provider_kind": "ZFS",
"props": {
"StorDriver/StorPoolName": "zpool_disk_drbd",
"StorDriver/internal/AllocationGranularity": "8"
},
"static_traits": {
"Provisioning": "Fat",
"SupportsSnapshots": "true"
},
"free_capacity": 4304441461,
"total_capacity": 9361686528,
"free_space_mgr_name": "pve3;drbd_disk",
"uuid": "f0805868-e47d-43dc-a44c-9e9ed3460df2",
"supports_snapshots": true,
"external_locking": false
}
]
]
Your issue is a bit different. Since you upgraded one node, I assume you also have a newer of ZFS. Linstor tries to get the default block size (https://github.com/LINBIT/linstor-server/commit/72ddcb483e).
On newer ZFS versions (> 2.2.0) this block size was changed to 16k instead of 8k: https://github.com/openzfs/zfs/commit/72f0521
So there really is a mismatch between these sizes, and because of known bugs in older DRBD versions LINSTOR will refuse to "mix" those storage pools. You may want to ask on the LINBIT forums how to proceed: https://forums.linbit.com/