linstor-server
linstor-server copied to clipboard
Newly created replicas become StandAlone
Hi, today I've tried to relocate some replicas and they were stuck in split-brain, this is weird because in some cases there was InUse resources.
I still not get what is wrong here, so I tried to collect all possible logs to sort it out.
Case 1
# linstor r l -r pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-09ca42d9-ec12-45ff-9251-ece417fba72f ┊ m1c12 ┊ 7423 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r c --auto-place 2 --replicas-on-same=opennebula-1 pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'm1c6 (thindata)'
SUCCESS:
Successfully set property key(s): StorPoolName
INFO:
Tie breaker resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' created on DfltDisklessStorPool
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Added peer(s) 'm1c6' to resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c10'
SUCCESS:
Added peer(s) 'm1c6' to resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c12'
SUCCESS:
Created resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c6'
ERROR:
Resource did not became ready on node 'm1c6' within reasonable time, check Satellite for errors.
SUCCESS:
Created resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c10'
SUCCESS:
Added peer(s) 'm1c10' to resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c12'
SUCCESS:
Added peer(s) 'm1c10' to resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c6'
SUCCESS:
Description:
Resource 'pvc-09ca42d9-ec12-45ff-9251-ece417fba72f' on 'm1c10' ready
Details:
Auto-placing resource: pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
# linstor r l -r pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-09ca42d9-ec12-45ff-9251-ece417fba72f ┊ m1c12 ┊ 7423 ┊ Unused ┊ StandAlone(m1c6) ┊ UpToDate ┊ ┊
┊ pvc-09ca42d9-ec12-45ff-9251-ece417fba72f ┊ m1c6 ┊ 7423 ┊ Unused ┊ StandAlone(m1c12) ┊ UpToDate ┊ 2021-02-02 10:02:34 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor v l -r pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m1c12 ┊ pvc-09ca42d9-ec12-45ff-9251-ece417fba72f ┊ thindata ┊ 0 ┊ 1332 ┊ /dev/drbd1332 ┊ 1.47 GiB ┊ Unused ┊ Outdated ┊
┊ m1c6 ┊ pvc-09ca42d9-ec12-45ff-9251-ece417fba72f ┊ thindata ┊ 0 ┊ 1332 ┊ /dev/drbd1332 ┊ 1.25 GiB ┊ InUse ┊ UpToDate ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@m1c6:~# drbdadm status pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
pvc-09ca42d9-ec12-45ff-9251-ece417fba72f role:Primary
disk:UpToDate
m1c10 role:Secondary
peer-disk:Diskless
m1c12 connection:StandAlone
root@m1c10:~# drbdadm status pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
pvc-09ca42d9-ec12-45ff-9251-ece417fba72f role:Secondary
disk:Diskless
m1c12 role:Secondary
peer-disk:Outdated
m1c6 role:Primary
peer-disk:UpToDate
root@m1c12:~# drbdadm status pvc-09ca42d9-ec12-45ff-9251-ece417fba72f
pvc-09ca42d9-ec12-45ff-9251-ece417fba72f role:Secondary
disk:Outdated
m1c10 role:Secondary
peer-disk:Diskless
m1c6 connection:StandAlone
root@m1c6:~# cat /proc/drbd
version: 9.0.27-1 (api:2/proto:86-118)
GIT-hash: bea41a056bb2abe4d5dfd2f69863282dfa1b2257 build by @kaniko-h5qtt1, 2021-01-08 10:32:54
Transports (api:16): tcp (9.0.27-1)
root@m1c10:~# cat /proc/drbd
version: 9.0.25-1 (api:2/proto:86-117)
GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by @kaniko-b9y9e3, 2020-10-06 15:43:58
Transports (api:16): tcp (9.0.25-1)
root@m1c12:~# cat /proc/drbd
version: 9.0.25-1 (api:2/proto:86-117)
GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by @kaniko-b9y9e3, 2020-10-06 15:43:58
Transports (api:16): tcp (9.0.25-1)
Case 2
# linstor r l -r pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ m1c10 ┊ 7425 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2020-09-22 22:33:37 ┊
┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ m1c4 ┊ 7425 ┊ InUse ┊ Ok ┊ Diskless ┊ 2021-02-01 16:51:01 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r c --auto-place 2 --replicas-on-same=opennebula-1 pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'pvc-44d7448c-588d-46ff-bd53-b451ae1511d2' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'm1c6 (thindata)'
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Added peer(s) 'm1c6' to resource 'pvc-44d7448c-588d-46ff-bd53-b451ae1511d2' on 'm1c4'
SUCCESS:
Added peer(s) 'm1c6' to resource 'pvc-44d7448c-588d-46ff-bd53-b451ae1511d2' on 'm1c10'
SUCCESS:
Created resource 'pvc-44d7448c-588d-46ff-bd53-b451ae1511d2' on 'm1c6'
ERROR:
Resource did not became ready on node 'm1c6' within reasonable time, check Satellite for errors.
# linstor r l -r pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ m1c10 ┊ 7425 ┊ Unused ┊ StandAlone(m1c6) ┊ UpToDate ┊ 2020-09-22 22:33:37 ┊
┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ m1c4 ┊ 7425 ┊ InUse ┊ StandAlone(m1c6) ┊ Diskless ┊ 2021-02-01 16:51:01 ┊
┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ m1c6 ┊ 7425 ┊ Unused ┊ Connecting(m1c4),StandAlone(m1c10) ┊ UpToDate ┊ 2021-02-02 10:04:28 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor v l -r pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m1c10 ┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ thindata ┊ 0 ┊ 1333 ┊ /dev/drbd1333 ┊ 1.38 GiB ┊ Unused ┊ UpToDate ┊
┊ m1c4 ┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ DfltDisklessStorPool ┊ 0 ┊ 1333 ┊ /dev/drbd1333 ┊ ┊ InUse ┊ Diskless ┊
┊ m1c6 ┊ pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 ┊ thindata ┊ 0 ┊ 1333 ┊ /dev/drbd1333 ┊ 2.05 MiB ┊ Unused ┊ UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@m1c4:~# drbdadm status pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 role:Primary
disk:Diskless
m1c10 role:Secondary
peer-disk:UpToDate
m1c6 connection:StandAlone
root@m1c6:~# drbdadm status pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 role:Secondary
disk:UpToDate
m1c10 connection:StandAlone
m1c4 connection:Connecting
root@m1c10:~# drbdadm status pvc-44d7448c-588d-46ff-bd53-b451ae1511d2
pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 role:Secondary
disk:UpToDate
m1c4 role:Primary
peer-disk:Diskless
m1c6 connection:StandAlone
root@m1c4:~# cat /proc/drbd
version: 9.0.27-1 (api:2/proto:86-118)
GIT-hash: bea41a056bb2abe4d5dfd2f69863282dfa1b2257 build by @kaniko-h5qtt1, 2021-01-08 10:32:54
Transports (api:16): tcp (9.0.27-1)
root@m1c6:~# cat /proc/drbd
version: 9.0.27-1 (api:2/proto:86-118)
GIT-hash: bea41a056bb2abe4d5dfd2f69863282dfa1b2257 build by @kaniko-h5qtt1, 2021-01-08 10:32:54
Transports (api:16): tcp (9.0.27-1)
root@m1c10:~# cat /proc/drbd
version: 9.0.25-1 (api:2/proto:86-117)
GIT-hash: 1053e9f98123e8293e9f2897af654b40cde0d24c build by @kaniko-b9y9e3, 2020-10-06 15:43:58
Transports (api:16): tcp (9.0.25-1)
Thank you for attaching the logs, but I am afraid that they are quite useless :(
The satellite logs only state that the resource was received and is going to be applied.
The controller log only states that an error occurred pointing to the attached ErrorReport, and that only states that something (I'd guess k8s) is trying to delete a resource which the linstor-controller rejects as that resource is currently in use.
Nothing form the logs has to do anything with the "Standalone" issue.
Can you first verify if this is really a splitbrain scenario (Standalone does not necessarily mean Splitbrain) by checking dmesg for the given resource?
Also, as usual, if this is reproducible, TRACE logs from the satellites would help a lot. The other thing is, can you please also check linstor err list for any ErrorReports from m1c6 from the time of this incident? The satellite might have reported an error, but as the controller waited and ran into the timeout, the client might not got noticed about the satellite-error report..
@ghernadi you're right, dmesg have no any messages about split-brain detection, but I'm worry that both devices are marked as UpToDate while only one is containing correct data.
I've tried to fix resource with drbdadm disconnect / drbdadm invalidate / drbdadm connect
but resource still become to Inconsistent state with the Unrelated data found
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Handshake to peer 0 successful: Agreed network protocol version 117
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Peer authenticated using 20 bytes HMAC
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Starting ack_recv thread (from drbd_r_pvc-44d7 [957116])
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2/0 drbd1333 m1c10: drbd_sync_handshake:
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2/0 drbd1333 m1c10: self 39A82B9F45E33FC6:0000000000000000:23D33860FB2E3AE0:D6FEE85957422E62 bits:2621894 flags:24
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2/0 drbd1333 m1c10: peer 1652810B12A73808:F2C997BFB52C03CE:0000000000000000:0000000000000000 bits:308259 flags:2
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2/0 drbd1333 m1c10: uuid_compare()=unrelated-data by rule 100
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2/0 drbd1333: Unrelated data, aborting!
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: conn( Connecting -> Disconnecting )
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: error receiving P_STATE, e: -5 l: 0!
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: ack_receiver terminated
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Terminating ack_recv thread
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Restarting sender thread
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Connection closed
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: helper command: /sbin/drbdadm disconnected
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: helper command: /sbin/drbdadm disconnected exit code 0
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: conn( Disconnecting -> StandAlone )
[Tue Feb 2 15:55:43 2021] drbd pvc-44d7448c-588d-46ff-bd53-b451ae1511d2 m1c10: Terminating receiver thread
Here is reproducible case from another cluster:
# linstor v l -r one-vm-11306-disk-0
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m10c3 ┊ one-vm-11306-disk-0 ┊ thindata ┊ 0 ┊ 3442 ┊ /dev/drbd3442 ┊ 2.53 GiB ┊ InUse ┊ UpToDate ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor rd auto-place one-vm-11306-disk-0
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'one-vm-11306-disk-0' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'm14c22 (thindata)'
SUCCESS:
Added peer(s) 'm14c22' to resource 'one-vm-11306-disk-0' on 'm10c3'
SUCCESS:
Created resource 'one-vm-11306-disk-0' on 'm14c22'
ERROR:
Resource did not became ready on node 'm14c22' within reasonable time, check Satellite for errors.
command terminated with exit code 10
# linstor r l -r one-vm-11306-disk-0
╭─────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-11306-disk-0 ┊ m10c3 ┊ 56529 ┊ InUse ┊ StandAlone(m14c22) ┊ UpToDate ┊ 2021-01-19 20:52:56 ┊
┊ one-vm-11306-disk-0 ┊ m14c22 ┊ 56529 ┊ Unused ┊ StandAlone(m10c3) ┊ UpToDate ┊ 2021-02-03 11:10:27 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor v l -r one-vm-11306-disk-0
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m10c3 ┊ one-vm-11306-disk-0 ┊ thindata ┊ 0 ┊ 3442 ┊ /dev/drbd3442 ┊ 2.53 GiB ┊ InUse ┊ UpToDate ┊
┊ m14c22 ┊ one-vm-11306-disk-0 ┊ thindata ┊ 0 ┊ 3442 ┊ /dev/drbd3442 ┊ 1.54 MiB ┊ Unused ┊ UpToDate ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Starting sender thread (from drbdsetup [1212368])
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0/0 drbd3442: rs_discard_granularity changed to 524288
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: conn( StandAlone -> Unconnected )
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Starting receiver thread (from drbd_w_one-vm-1 [101402])
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: conn( Unconnected -> Connecting )
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Handshake to peer 0 successful: Agreed network protocol version 118
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Peer authenticated using 20 bytes HMAC
[Wed Feb 3 12:10:41 2021] drbd one-vm-11306-disk-0 m14c22: Starting ack_recv thread (from drbd_r_one-vm-1 [1212374])
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Preparing remote state change 120870361
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0/0 drbd3442 m14c22: drbd_sync_handshake:
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0/0 drbd3442 m14c22: self 5291AF7C0D3DB238:74FD0E2362D2B70A:0000000000000000:0000000000000000 bits:654936 flags:20
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0/0 drbd3442 m14c22: peer 184F688ABD7C2EA8:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1020
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0/0 drbd3442 m14c22: uuid_compare()=source-copy-other-bitmap by rule 72
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0/0 drbd3442 m14c22: Copying bitmap of peer node_id=3 (bitmap_index=1)
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Aborting remote state change 120870361
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: conn( Connecting -> Disconnecting )
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: ack_receiver terminated
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Terminating ack_recv thread
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Restarting sender thread
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Connection closed
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: helper command: /sbin/drbdadm disconnected
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: helper command: /sbin/drbdadm disconnected exit code 0
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: conn( Disconnecting -> StandAlone )
[Wed Feb 3 12:10:42 2021] drbd one-vm-11306-disk-0 m14c22: Terminating receiver thread
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0: Starting worker thread (from drbdsetup [232138])
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Starting sender thread (from drbdsetup [232145])
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: meta-data IO uses: blk-bio
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: rs_discard_granularity changed to 524288
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: disk( Diskless -> Attaching )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: Maximum number of peer devices = 7
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0: Method to ensure write ordering: flush
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: drbd_bm_resize called with capacity == 10491632
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: resync bitmap: bits=1311454 words=143444 pages=281
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: size = 5123 MB (5245816 KB)
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: recounting of set bits took additional 0ms
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: disk( Attaching -> UpToDate )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: attached to current UUID: 184F688ABD7C2EA8
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: size = 5123 MB (5245816 KB)
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: conn( StandAlone -> Unconnected )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Starting receiver thread (from drbd_w_one-vm-1 [232140])
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: conn( Unconnected -> Connecting )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Handshake to peer 1 successful: Agreed network protocol version 118
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Peer authenticated using 20 bytes HMAC
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Starting ack_recv thread (from drbd_r_one-vm-1 [232160])
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0: Preparing cluster-wide state change 120870361 (0->1 499/146)
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442 m10c3: drbd_sync_handshake:
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442 m10c3: self 184F688ABD7C2EA8:0000000000000000:74FD0E2362D2B70A:0000000000000000 bits:0 flags:20
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442 m10c3: peer 5291AF7C0D3DB238:74FD0E2362D2B70A:0000000000000000:0000000000000000 bits:654936 flags:1020
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442 m10c3: uuid_compare()=unrelated-data by rule 100
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0/0 drbd3442: Unrelated data, aborting!
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0: Aborting cluster-wide state change 120870361 (40ms) rv = -27
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: conn( Connecting -> Disconnecting )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: ack_receiver terminated
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Terminating ack_recv thread
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Restarting sender thread
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Connection closed
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: helper command: /sbin/drbdadm disconnected
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: helper command: /sbin/drbdadm disconnected exit code 0
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: conn( Disconnecting -> StandAlone )
[Wed Feb 3 12:10:28 2021] drbd one-vm-11306-disk-0 m10c3: Terminating receiver thread
drbd 9.0.27-1 everywhere
ref: https://lists.linbit.com/pipermail/drbd-user/2021-February/025818.html
This issue is starting annoying as it also repeatable on production cluster:
# linstor r l -r one-vm-5629-disk-0
╭─────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-5629-disk-0 ┊ m11c4 ┊ 8343 ┊ InUse ┊ Ok ┊ Diskless ┊ 2021-02-23 14:21:27 ┊
┊ one-vm-5629-disk-0 ┊ m13c5 ┊ 8343 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2020-11-30 09:40:13 ┊
╰─────────────────────────────────────────────────────────────────────────────────────╯
# linstor rd auto-place one-vm-5629-disk-0
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'one-vm-5629-disk-0' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'm9c6 (thindata)'
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Added peer(s) 'm9c6' to resource 'one-vm-5629-disk-0' on 'm11c4'
SUCCESS:
Added peer(s) 'm9c6' to resource 'one-vm-5629-disk-0' on 'm13c5'
SUCCESS:
Created resource 'one-vm-5629-disk-0' on 'm9c6'
ERROR:
Resource did not became ready on node 'm9c6' within reasonable time, check Satellite for errors.
# linstor r l -r one-vm-5629-disk-0
╭───────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-5629-disk-0 ┊ m11c4 ┊ 8343 ┊ InUse ┊ StandAlone(m9c6) ┊ Diskless ┊ 2021-02-23 14:21:27 ┊
┊ one-vm-5629-disk-0 ┊ m13c5 ┊ 8343 ┊ Unused ┊ Connecting(m9c6) ┊ UpToDate ┊ 2020-11-30 09:40:13 ┊
┊ one-vm-5629-disk-0 ┊ m9c6 ┊ 8343 ┊ Unused ┊ Connecting(m13c5,m11c4) ┊ Outdated ┊ 2021-02-23 14:24:05 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@m11c4:~# drbdadm status one-vm-5629-disk-0
one-vm-5629-disk-0 role:Primary
disk:Diskless
m13c5 role:Secondary
peer-disk:UpToDate
m9c6 connection:StandAlone
root@m13c5:~# drbdadm status one-vm-5629-disk-0
one-vm-5629-disk-0 role:Secondary
disk:UpToDate
m11c4 role:Primary
peer-disk:Diskless
m9c6 connection:Connecting
root@m9c6:~# drbdadm status one-vm-5629-disk-0
one-vm-5629-disk-0 role:Secondary
disk:Outdated quorum:no
m11c4 connection:Connecting
m13c5 connection:Connecting
drbd v9.0.27 everywhere, satellite logs are clean,
14:24:04.248 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'one-vm-5629-disk-0' created for node 'm11c4'.
14:24:04.248 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'one-vm-5629-disk-0' created for node 'm13c5'.
14:24:04.248 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'one-vm-5629-disk-0' created for node 'm9c6'.
Error reports are missing
Ah, I just got ProtocolError:
root@m9c6:~# drbdadm status one-vm-5629-disk-0
one-vm-5629-disk-0 role:Secondary
disk:Outdated quorum:no
m11c4 connection:Connecting
m13c5 connection:ProtocolError
dmesg:
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: conn( Unconnected -> Connecting )
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: Handshake to peer 0 successful: Agreed network protocol version 118
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: Peer authenticated using 20 bytes HMAC
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: Starting ack_recv thread (from drbd_r_one-vm-5 [4064065])
[Tue Feb 23 15:36:17 2021] drbd one-vm-5629-disk-0 m13c5: Preparing remote state change 2132852538
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: drbd_sync_handshake:
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: self 48E1728DB89F7132:0000000000000000:4C7918E0BFE7C70E:C41276A86EAA38AE bits:20972149 flags:120
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: peer 3937C6E51D256162:C41276A86EAA38AE:48E1728DB89F7132:4C7918E0BFE7C70E bits:20975990 flags:1100
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: uuid_compare()=target-set-bitmap by rule 60
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: Setting and writing one bitmap slot, after drbd_sync_handshake
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: Committing remote state change 2132852538 (primary_nodes=2)
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342: quorum( no -> yes )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: bitmap overflow (e:20975989) while decoding bm RLE packet
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: error receiving P_COMPRESSED_BITMAP, e: -5 l: 7!
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: conn( Connected -> ProtocolError ) peer( Secondary -> Unknown )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342: quorum( yes -> no )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0/0 drbd2342 m13c5: pdsk( UpToDate -> DUnknown ) repl( WFBitMapT -> Off )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: ack_receiver terminated
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: Terminating ack_recv thread
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: Restarting sender thread
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: Connection closed
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: helper command: /sbin/drbdadm disconnected
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: helper command: /sbin/drbdadm disconnected exit code 0
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: conn( ProtocolError -> Unconnected )
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: Restarting receiver thread
[Tue Feb 23 15:36:18 2021] drbd one-vm-5629-disk-0 m13c5: conn( Unconnected -> Connecting )
Hi, any idea what is happened and how to debug this?
# linstor r l -r one-vm-11169-disk-2
╭────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-11169-disk-2 ┊ m10c44 ┊ 55304 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2021-01-28 22:50:48 ┊
┊ one-vm-11169-disk-2 ┊ m13c33 ┊ 55304 ┊ Unused ┊ Ok ┊ Diskless ┊ 2021-01-28 22:51:13 ┊
╰────────────────────────────────────────────────────────────────────────────────────────╯
# linstor rd auto-place one-vm-11169-disk-2
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'one-vm-11169-disk-2' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'm15c15 (thindata)'
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Added peer(s) 'm15c15' to resource 'one-vm-11169-disk-2' on 'm13c33'
SUCCESS:
Added peer(s) 'm15c15' to resource 'one-vm-11169-disk-2' on 'm10c44'
SUCCESS:
Created resource 'one-vm-11169-disk-2' on 'm15c15'
ERROR:
Resource did not became ready on node 'm15c15' within reasonable time, check Satellite for errors.
command terminated with exit code 10
# linstor r l -r one-vm-11169-disk-2
╭─────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-11169-disk-2 ┊ m10c44 ┊ 55304 ┊ InUse ┊ StandAlone(m15c15) ┊ UpToDate ┊ 2021-01-28 22:50:48 ┊
┊ one-vm-11169-disk-2 ┊ m13c33 ┊ 55304 ┊ Unused ┊ Ok ┊ Diskless ┊ 2021-01-28 22:51:13 ┊
┊ one-vm-11169-disk-2 ┊ m15c15 ┊ 55304 ┊ Unused ┊ StandAlone(m10c44) ┊ Outdated ┊ 2021-03-24 15:47:06 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
m10c44:
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Starting sender thread (from drbdsetup [3350253])
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205: rs_discard_granularity changed to 524288
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: conn( StandAlone -> Unconnected )
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Starting receiver thread (from drbd_w_one-vm-1 [2699111])
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: conn( Unconnected -> Connecting )
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Handshake to peer 1 successful: Agreed network protocol version 118
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Peer authenticated using 20 bytes HMAC
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Starting ack_recv thread (from drbd_r_one-vm-1 [3350259])
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m13c33: Preparing remote state change 3775243845
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m13c33: Committing remote state change 3775243845 (primary_nodes=1)
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2: Preparing cluster-wide state change 995215854 (0->1 499/145)
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: drbd_sync_handshake:
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: self B9C629DA9115404B:0000000000000000:0000000000000000:0000000000000000 bits:522941 flags:20
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: peer 8C146BA58DADE3B4:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1020
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: uuid_compare()=source-copy-other-bitmap by rule 72
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: Copying bitmap of peer node_id=3 (bitmap_index=1)
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2: Declined by peer m15c15 (id: 1), see the kernel log there
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2: Aborting cluster-wide state change 995215854 (32ms) rv = -10
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: conn( Connecting -> Disconnecting )
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: ack_receiver terminated
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Terminating ack_recv thread
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Restarting sender thread
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Connection closed
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: helper command: /sbin/drbdadm disconnected
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: helper command: /sbin/drbdadm disconnected exit code 0
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: conn( Disconnecting -> StandAlone )
[Wed Mar 24 16:45:29 2021] drbd one-vm-11169-disk-2 m15c15: Terminating receiver thread
m13c33:
[Wed Mar 24 16:48:03 2021] drbd one-vm-11169-disk-2 m15c15: Starting sender thread (from drbdsetup [1680333])
[Wed Mar 24 16:48:03 2021] drbd one-vm-11169-disk-2/0 drbd1205: Would lose quorum, but using tiebreaker logic to keep
[Wed Mar 24 16:48:03 2021] drbd one-vm-11169-disk-2 m15c15: conn( StandAlone -> Unconnected )
[Wed Mar 24 16:48:03 2021] drbd one-vm-11169-disk-2 m15c15: Starting receiver thread (from drbd_w_one-vm-1 [4074739])
[Wed Mar 24 16:48:03 2021] drbd one-vm-11169-disk-2 m15c15: conn( Unconnected -> Connecting )
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Handshake to peer 1 successful: Agreed network protocol version 118
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Peer authenticated using 20 bytes HMAC
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Starting ack_recv thread (from drbd_r_one-vm-1 [1680338])
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Preparing remote state change 3775243845
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: Committing remote state change 3775243845 (primary_nodes=1)
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m15c15: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2/0 drbd1205 m15c15: pdsk( UpToDate -> Outdated )
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m10c44: Preparing remote state change 995215854
[Wed Mar 24 16:48:04 2021] drbd one-vm-11169-disk-2 m10c44: Aborting remote state change 995215854
m15c15:
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2: Starting worker thread (from drbdsetup [816161])
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m10c44: Starting sender thread (from drbdsetup [816165])
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m13c33: Starting sender thread (from drbdsetup [816168])
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: meta-data IO uses: blk-bio
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: rs_discard_granularity changed to 524288
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: disk( Diskless -> Attaching )
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: Maximum number of peer devices = 7
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2: Method to ensure write ordering: flush
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: drbd_bm_resize called with capacity == 10491632
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: resync bitmap: bits=1311454 words=143444 pages=281
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: size = 5123 MB (5245816 KB)
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: recounting of set bits took additional 0ms
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: disk( Attaching -> UpToDate ) quorum( no -> yes )
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: attached to current UUID: 8C146BA58DADE3B4
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2/0 drbd1205: size = 5123 MB (5245816 KB)
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m10c44: conn( StandAlone -> Unconnected )
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m10c44: Starting receiver thread (from drbd_w_one-vm-1 [816162])
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m10c44: conn( Unconnected -> Connecting )
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m13c33: conn( StandAlone -> Unconnected )
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m13c33: Starting receiver thread (from drbd_w_one-vm-1 [816162])
[Wed Mar 24 16:47:34 2021] drbd one-vm-11169-disk-2 m13c33: conn( Unconnected -> Connecting )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Handshake to peer 0 successful: Agreed network protocol version 118
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Peer authenticated using 20 bytes HMAC
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Starting ack_recv thread (from drbd_r_one-vm-1 [816183])
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m13c33: Handshake to peer 2 successful: Agreed network protocol version 118
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m13c33: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m13c33: Peer authenticated using 20 bytes HMAC
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m13c33: Starting ack_recv thread (from drbd_r_one-vm-1 [816184])
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2: Preparing cluster-wide state change 3775243845 (1->2 499/146)
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2: State change 3775243845: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2: Committing cluster-wide state change 3775243845 (36ms)
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m13c33: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205: disk( UpToDate -> Outdated )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205 m13c33: pdsk( DUnknown -> Diskless ) repl( Off -> Established )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Preparing remote state change 995215854
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205 m10c44: drbd_sync_handshake:
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205 m10c44: self 8C146BA58DADE3B4:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:20
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205 m10c44: peer B9C629DA9115404B:0000000000000000:0000000000000000:0000000000000000 bits:522941 flags:1020
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205 m10c44: uuid_compare()=unrelated-data by rule 100
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2/0 drbd1205: Unrelated data, aborting!
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Aborting remote state change 995215854
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: conn( Connecting -> Disconnecting )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: ack_receiver terminated
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Terminating ack_recv thread
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Restarting sender thread
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Connection closed
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: helper command: /sbin/drbdadm disconnected
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: helper command: /sbin/drbdadm disconnected exit code 0
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: conn( Disconnecting -> StandAlone )
[Wed Mar 24 16:47:35 2021] drbd one-vm-11169-disk-2 m10c44: Terminating receiver thread
Hello, sorry for the late response! A few questions:
- is this reproducible?
- Do I correctly assume the storage pool is LVM thin?
- can you show me
rd lp one-vm-11169-disk-2 - what is with the times? I know that such times are never fully synchronous, but you have multiple minutes between them..
drbdadm --versionof all machines
No problem, I understand you might more important tasks than free community support 😉
-
It is reproducible for the specific resources, we have many of them
-
yes, we use thinlvm and diskless replicas
-
╭──────────────────────────────────────╮ ┊ Key ┊ Value ┊ ╞══════════════════════════════════════╡ ┊ Aux/one/DISK_ID ┊ 2 ┊ ┊ Aux/one/DS_ID ┊ 200 ┊ ┊ Aux/one/VM_ID ┊ 11169 ┊ ┊ DrbdOptions/Resource/quorum ┊ off ┊ ┊ DrbdPrimarySetOn ┊ M10C44 ┊ ╰──────────────────────────────────────╯ -
servers are running chrony service. So time is synchronized with the NTP-server and no differs more than few milliseconds.
-
node has:
DRBDADM_BUILDTAG=GIT-hash:\ fa9b9d3823b6e1792919e711fcf6164cac629290\ build\ by\ buildd@lcy01-amd64-017\,\ 2020-11-05\ 14:09:53 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x09001b DRBD_KERNEL_VERSION=9.0.27 DRBDADM_VERSION_CODE=0x090f01 DRBDADM_VERSION=9.15.1but linstor-satellite is containerized and use its own drbdadm version from the container:
DRBDADM_BUILDTAG=GIT-hash:\ a7820b3c14497a34f955ba5ce56cf1bc9d2d353e\ build\ by\ @kaniko-a1e4rg\,\ 2021-02-11\ 14:59:25 DRBDADM_API_VERSION=2 DRBD_KERNEL_VERSION_CODE=0x09001b DRBD_KERNEL_VERSION=9.0.27 DRBDADM_VERSION_CODE=0x091000 DRBDADM_VERSION=9.16.0
Okay...
- In that case I guess we need to figure out what the difference is between those "broken" resources and the "good" resources.. Any ideas? Ideal would be if we can reproduce those circumstances so I can test this locally...
- ok
- Can you also show me the props from the volume-definition(s) of that RD?
- Well then you might have posted wrong / uncorrelated dmesg snippets?
- thanks
-
linstor vd lp one-vm-11169-disk-2 0 ╭───────────────────────────────────────────────────────────────╮ ┊ Key ┊ Value ┊ ╞═══════════════════════════════════════════════════════════════╡ ┊ DrbdCurrentGi ┊ 8C146BA58DADE3B4 ┊ ┊ DrbdOptions/Disk/discard-zeroes-if-aligned ┊ yes ┊ ┊ DrbdOptions/Disk/rs-discard-granularity ┊ 65536 ┊ ╰───────────────────────────────────────────────────────────────╯ - this is interesting note, as can I see the time of
dmesg -Tis slight different from thedateoutput. I'll try to investigate this more...
UPD: it seems this is normal situation for dmesg: https://serverfault.com/a/715975/205043. Better I will use journalctl -kf instead:
# linstor r l -r one-vm-11169-disk-2
╭─────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-11169-disk-2 ┊ m10c44 ┊ 55304 ┊ InUse ┊ StandAlone(m11c22) ┊ UpToDate ┊ 2021-01-28 22:50:48 ┊
┊ one-vm-11169-disk-2 ┊ m11c22 ┊ 55304 ┊ Unused ┊ StandAlone(m10c44) ┊ Outdated ┊ 2021-03-29 13:44:47 ┊
┊ one-vm-11169-disk-2 ┊ m13c33 ┊ 55304 ┊ Unused ┊ Ok ┊ Diskless ┊ 2021-01-28 22:51:13 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
m10c44
Mar 29 15:48:17 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Starting sender thread (from drbdsetup [540613])
Mar 29 15:48:17 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205: rs_discard_granularity changed to 524288
Mar 29 15:48:17 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: conn( StandAlone -> Unconnected )
Mar 29 15:48:17 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Starting receiver thread (from drbd_w_one-vm-1 [2699111])
Mar 29 15:48:17 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: conn( Unconnected -> Connecting )
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Handshake to peer 1 successful: Agreed network protocol version 118
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Peer authenticated using 20 bytes HMAC
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Starting ack_recv thread (from drbd_r_one-vm-1 [540666])
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2: Preparing cluster-wide state change 1569939058 (0->1 499/145)
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: drbd_sync_handshake:
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: self B9C629DA9115404B:0000000000000000:0000000000000000:0000000000000000 bits:528500 flags:20
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: peer 8C146BA58DADE3B4:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1020
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: uuid_compare()=source-copy-other-bitmap by rule 72
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: Copying bitmap of peer node_id=3 (bitmap_index=1)
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2: Declined by peer m11c22 (id: 1), see the kernel log there
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2: Aborting cluster-wide state change 1569939058 (44ms) rv = -10
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: conn( Connecting -> Disconnecting )
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: ack_receiver terminated
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Terminating ack_recv thread
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m13c33: Preparing remote state change 3997385834
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m13c33: Committing remote state change 3997385834 (primary_nodes=1)
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Restarting sender thread
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Connection closed
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: helper command: /sbin/drbdadm disconnected
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: helper command: /sbin/drbdadm disconnected exit code 0
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: conn( Disconnecting -> StandAlone )
Mar 29 15:48:18 m10c44 kernel: drbd one-vm-11169-disk-2 m11c22: Terminating receiver thread
m11c22
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2: Starting worker thread (from drbdsetup [1475932])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Starting sender thread (from drbdsetup [1475938])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Starting sender thread (from drbdsetup [1475940])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: meta-data IO uses: blk-bio
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: rs_discard_granularity changed to 524288
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: disk( Diskless -> Attaching )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: Maximum number of peer devices = 7
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2: Method to ensure write ordering: flush
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: drbd_bm_resize called with capacity == 10491632
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: resync bitmap: bits=1311454 words=143444 pages=281
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: size = 5123 MB (5245816 KB)
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: recounting of set bits took additional 0ms
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: disk( Attaching -> UpToDate ) quorum( no -> yes )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: attached to current UUID: 8C146BA58DADE3B4
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: size = 5123 MB (5245816 KB)
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: conn( StandAlone -> Unconnected )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Starting receiver thread (from drbd_w_one-vm-1 [1475933])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: conn( Unconnected -> Connecting )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: conn( StandAlone -> Unconnected )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Starting receiver thread (from drbd_w_one-vm-1 [1475933])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: conn( Unconnected -> Connecting )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Handshake to peer 0 successful: Agreed network protocol version 118
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Peer authenticated using 20 bytes HMAC
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Starting ack_recv thread (from drbd_r_one-vm-1 [1476011])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Handshake to peer 2 successful: Agreed network protocol version 118
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Peer authenticated using 20 bytes HMAC
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: Starting ack_recv thread (from drbd_r_one-vm-1 [1476013])
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Preparing remote state change 1569939058
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m10c44: drbd_sync_handshake:
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m10c44: self 8C146BA58DADE3B4:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:20
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m10c44: peer B9C629DA9115404B:0000000000000000:0000000000000000:0000000000000000 bits:528500 flags:1020
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m10c44: uuid_compare()=unrelated-data by rule 100
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: Unrelated data, aborting!
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Aborting remote state change 1569939058
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: conn( Connecting -> Disconnecting )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: ack_receiver terminated
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Terminating ack_recv thread
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2: Preparing cluster-wide state change 3997385834 (1->2 499/146)
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2: State change 3997385834: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFF8
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2: Committing cluster-wide state change 3997385834 (40ms)
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m13c33: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205: disk( UpToDate -> Outdated )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m13c33: pdsk( DUnknown -> Diskless ) repl( Off -> Established )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Restarting sender thread
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Connection closed
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: helper command: /sbin/drbdadm disconnected
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: helper command: /sbin/drbdadm disconnected exit code 0
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: conn( Disconnecting -> StandAlone )
Mar 29 15:48:18 m11c22 kernel: drbd one-vm-11169-disk-2 m10c44: Terminating receiver thread
m13c33
Mar 29 15:48:17 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Starting sender thread (from drbdsetup [425141])
Mar 29 15:48:17 m13c33 kernel: drbd one-vm-11169-disk-2/0 drbd1205: Would lose quorum, but using tiebreaker logic to keep
Mar 29 15:48:17 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: conn( StandAlone -> Unconnected )
Mar 29 15:48:17 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Starting receiver thread (from drbd_w_one-vm-1 [4074739])
Mar 29 15:48:17 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: conn( Unconnected -> Connecting )
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Handshake to peer 1 successful: Agreed network protocol version 118
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Peer authenticated using 20 bytes HMAC
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Starting ack_recv thread (from drbd_r_one-vm-1 [425155])
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m10c44: Preparing remote state change 1569939058
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m10c44: Aborting remote state change 1569939058
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Preparing remote state change 3997385834
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: Committing remote state change 3997385834 (primary_nodes=1)
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2 m11c22: conn( Connecting -> Connected ) peer( Unknown -> Secondary )
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
Mar 29 15:48:18 m13c33 kernel: drbd one-vm-11169-disk-2/0 drbd1205 m11c22: pdsk( UpToDate -> Outdated )
Mainly I confused by this message: Unrelated data, aborting!, could you clarify in which cases it can occurs?
I found some solution here, but this is a good question why is metadata can be corrupted and how to verify that? https://support.sciencelogic.com/s/article/3518
That is a very good question, which is also why I was asking for vd lp ...
A bit theory: Linstor initially sets a GI (generation ID) to DRBD. If that peer modifies the data, a new GI is generated (not for every write request obviously, but in a more clever way, but explaining that would be quite too much detail). If Linstor creates a new peer, Linstor sets the same original GI to the new peer so the first DRBD peer can find it in its own history of GIs to calculate the partial sync (instead of a full-sync which would be quite annoying especially with thin pools).
However, in your case the first peer has a completely different GI and additionally also no history GIs. I am not quite sure how you got into this situation...
Anything special with those resources? restored from snapshot, or did you already had troubles with those resources (node crash, had to re-setup things, ...)? As I said, I would like to find out what is different with these resources than with the others where everything seems to be working quite well.
Thank you for detailed explanation!
your case the first peer has a completely different GI and additionally also no history GIs
May I ask how did you figure up about this fact?
Anything special with those resources? restored from snapshot, or did you already had troubles with those resources (node crash, had to re-setup things, ...)?Anything special with those resources? restored from snapshot, or did you already had troubles with those resources (node crash, had to re-setup things, ...)?
Well, we had some problems with broken bitmap on older drbd versions. Especially on v9.0.19. Some resources were synchronized not fully, but were marked as UpToDate. We were forcing to update kernel module asap, and perform
dd if=/dev/drbd1461 of=/dev/drbd1461 status=progress bs=65536 conv=notrunc,sparse iflag=direct oflag=direct
for each of them to avoid such situations.
But I have few other resources with the same problem which were created already after all our nodes were updated on 9.0.27 eg the last one:
# linstor r l -r one-vm-11374-disk-0
╭─────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-11374-disk-0 ┊ m10c45 ┊ 57066 ┊ Unused ┊ StandAlone(m13c8) ┊ UpToDate ┊ 2021-01-31 14:53:47 ┊
┊ one-vm-11374-disk-0 ┊ m11c30 ┊ 57066 ┊ InUse ┊ Ok ┊ Diskless ┊ 2021-02-03 08:50:33 ┊
┊ one-vm-11374-disk-0 ┊ m13c8 ┊ 57066 ┊ Unused ┊ StandAlone(m10c45) ┊ Outdated ┊ 2021-02-03 09:41:24 ┊
┊ one-vm-11374-disk-0 ┊ m14c12 ┊ 57066 ┊ Unused ┊ Ok ┊ Diskless ┊ 2021-01-31 14:53:54 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────╯
at the 2021-01-31 14:53:47 we had drbd v9.0.27 and linstor v1.11.0 installed on all our nodes. On February 23 we updated linstor to v1.11.1
Also I can append that our case with the broken bitmap was able to be solved by using the following workaround for the new replica without waiting, run:
drbdadm down
drbdadm invalidate
drbdadm up
which would trigger full resync and forces copy all the data from the old to new diskful replica.
In current case I see the same error and resources become StandAlone:
[6396089.725914] drbd one-vm-11169-disk-2/0 drbd1205 m10c44: uuid_compare()=unrelated-data by rule 100
[6396089.725916] drbd one-vm-11169-disk-2/0 drbd1205: Unrelated data, aborting!
Can you please upgrade to the newest DRBD version? When I tried to reproduce your issue, we discovered a bug that caused such an unrelated-data. That should be fixed with the newest DRBD version.
@ghernadi thank you, I'll take a look. Could you please clarify about two facts:
- The bug was in v9.0.19 and should be fixed in v9.1.2?
- Will upgrade of the DRBD kernel module fix the broken volumes, and if not then how to fix them?
- As far as I remember the bug was in v9.0.27. Should be fixed in v9.1.2 as well as in v9.0.29
- Yes, that should fix broken volumes
Hello @ghernadi @kvaps . I think I am having the same issue. I have two nodes and when I create a second replica it always causes the two replicas to be on a standalone status.
My nodes have the DRBD version:
DRBDADM_BUILDTAG=GIT-hash:\ bd7a08c4ec9cad113c4e5ad448a15c8900a67b68\ build\ by\ @buildsystem\,\ 2021-08-02\ 11:21:41
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x09001d
DRBD_KERNEL_VERSION=9.0.29
DRBDADM_VERSION_CODE=0x091202
DRBDADM_VERSION=9.18.2
and
DRBDADM_BUILDTAG=GIT-hash:\ bd7a08c4ec9cad113c4e5ad448a15c8900a67b68\ build\ by\ @buildsystem\,\ 2021-08-02\ 11:21:41
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x09001e
DRBD_KERNEL_VERSION=9.0.30
DRBDADM_VERSION_CODE=0x091202
DRBDADM_VERSION=9.18.2
So the version should not have the bug right? Can this version difference be causing it?
Also tried
drbdadm down
drbdadm invalidate
drbdadm up
in the second replica. Just makes it go to Inconsistent.
There is no linstor error report. Just in the replica creation it does not get ready on time. Any suggestions What can I do to debug this?
Same issue with: drbd v9.2.1 and linstor v1.20.3
I have one replica on master0, when I create the second one on node0, it becomes to StandAlone:
linstor:
| pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | master0 | 7114 | Unused | StandAlone(node0) | UpToDate | 2023-02-14 02:58:45 |
| pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | node0 | 7114 | Unused | StandAlone(master0) | UpToDate | 2023-02-15 09:54:01 |
| master0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 4.88 GiB | Unused | UpToDate |
| node0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 2.05 MiB | Unused | UpToDate |
kernel log:
[86502.664301] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: drbd_sync_handshake:
[86502.664307] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: self 71C46F244B98AD50:0000000000000000:9074A21F98EFA566:9216DCDD10088160 bits:0 flags:20
[86502.664311] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: peer 2A4FF6268079D3F4:9216DCDD10088160:A3B9CD8117548DDA:A21A3758EE513654 bits:1273745 flags:1020
[86502.664315] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: uuid_compare()=unrelated-data by rule=history-both
[86502.664317] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: Unrelated data, aborting!
The drbdadm invalidate also doesn't help
driver: LVMThin
Just found a workaroud to fix this resource:
on the node which contains the most consistent data (in my case master0):
drbdadm get-gi pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd
2A4FF6268079D3F4:9216DCDD10088160:A3B9CD8117548DDA:A21A3758EE513654:1:1:0:0:0:0:0:0:0:0:0:0
then delete failed replicas, and do:
# save old DrbdCurrentGi somewhere (just in case)
linstor vd lp pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 0
# set new DrbdCurrentGi
linstor vd sp pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 0 DrbdCurrentGi 2A4FF6268079D3F4
create new replicas, now sync should go well. After resync, check the allocated size in volumes:
# linstor v l -r pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd
+------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|====================================================================================================================================|
| master0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 4.88 GiB | Unused | UpToDate |
| node0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 4.71 GiB | Unused | UpToDate |
+------------------------------------------------------------------------------------------------------------------------------------+
This is indeed quite strange. New resources should have almost only 0s in their GID-history
[86502.664307] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: self 71C46F244B98AD50:0000000000000000:9074A21F98EFA566:9216DCDD10088160 bits:0 flags:20
[86502.664311] drbd pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd/0 drbd1114 master0: peer 2A4FF6268079D3F4:9216DCDD10088160:A3B9CD8117548DDA:A21A3758EE513654 bits:1273745 flags:1020
master0 (in the logs master0: self ...) does have reasons to have a history of GIDs, since the resource was there for quite some time (that is what I assume at least).
node0 on the other hand (in the logs master: peer ...) should have only GID that Linstor set initially (the original value of the property DrbdCurrentGi) and the rest of the entries should have been 0s.
# save old DrbdCurrentGi somewhere (just in case)
Can you confirm that 9216DCDD10088160 was the previously set DrbdCurrentGi ?
Anyways, I would not have expected GIDs on node0's side. Can you please collect all logs you find regarding this resource? Ideally not just from these two nodes, but from all nodes where the resource was deployed.
I do not fully know how, but if we can find traces that for example node0 never deleted the actual LV, but Linstor just somehow forgot the resource is there (i.e. with node lost), and now "found" the once orphaned LV again.... that would at least explain this behavior. The next question would be how Linstor forgot about the resource. But lets make one step after the other.
If this guess ^ is correct, it should also work (next time this happens), if you do not change / alter the DrbdCurrentGi, but simply re-create the wrongly created resource. The thinking behind this is that by deleting the wrong resources Linstor should really delete the LV (which you could double-check). The point is, that during a node lost command the LV cannot be removed properly since the satellite has to be offline (otherwise Linstor rejects the command), so Linstor assumed that the satellite is already dead -> nothing cleaning up the LV and other left-overs.
Can you confirm that 9216DCDD10088160 was the previously set DrbdCurrentGi ?
Hi @ghernadi, previusly it was:
# linstor vd lp pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 0
+---------------------------------------------------------------+
| Key | Value |
|===============================================================|
| DrbdCurrentGi | 71C46F244B98AD50 |
| DrbdOptions/Disk/discard-zeroes-if-aligned | yes |
| DrbdOptions/Disk/rs-discard-granularity | 262144 |
+---------------------------------------------------------------+
BTW, I have drbd metadata dump from the original device, If this will help to investigate the problem:
root@master0:/# drbdadm apply-al pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd
root@master0:/# drbdadm dump-md pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd
output: drbd-metadata.txt
Anyways, I would not have expected GIDs on node0's side. Can you please collect all logs you find regarding this resource? Ideally not just from these two nodes, but from all nodes where the resource was deployed.
I do not fully know how, but if we can find traces that for example node0 never deleted the actual LV, but Linstor just somehow forgot the resource is there (i.e. with node lost), and now "found" the once orphaned LV again.... that would at least explain this behavior. The next question would be how Linstor forgot about the resource. But lets make one step after the other.
If this guess ^ is correct, it should also work (next time this happens), if you do not change / alter the DrbdCurrentGi, but simply re-create the wrongly created resource. The thinking behind this is that by deleting the wrong resources Linstor should really delete the LV (which you could double-check). The point is, that during a node lost command the LV cannot be removed properly since the satellite has to be offline (otherwise Linstor rejects the command), so Linstor assumed that the satellite is already dead -> nothing cleaning up the LV and other left-overs.
That's indeed was happen:
Today, when I was removing replicas, I've seen No active connection to satellite again (see: https://github.com/LINBIT/linstor-server/issues/331).
But I was able to remove them (see: https://github.com/LINBIT/linstor-server/issues/330)
Thus I forced to login on the node, and remove LV manually, then I restarted all the satellites and controller services.
But, problem has occurred previously. This problem has occurred when node eviction was invoked.
Kernel log from nodes:
You can check master0 and node1, today (Feb 15 10:02) I did the simple action with it:
- I had single replica on
master0 - I created second diskful replica on
node1 - Got the spit-brain due to
Unrelated data - Removed replica from
node1
Just found a workaroud to fix this resource:
on the node which contains the most consistent data (in my case
master0):drbdadm get-gi pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 2A4FF6268079D3F4:9216DCDD10088160:A3B9CD8117548DDA:A21A3758EE513654:1:1:0:0:0:0:0:0:0:0:0:0then delete failed replicas, and do:
# save old DrbdCurrentGi somewhere (just in case) linstor vd lp pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 0 # set new DrbdCurrentGi linstor vd sp pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd 0 DrbdCurrentGi 2A4FF6268079D3F4create new replicas, now sync should go well. After resync, check the allocated size in volumes:
# linstor v l -r pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd +------------------------------------------------------------------------------------------------------------------------------------+ | Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State | |====================================================================================================================================| | master0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 4.88 GiB | Unused | UpToDate | | node0 | pvc-bfe182ba-68ab-4586-b274-3d4c324c8ebd | store | 0 | 1114 | /dev/drbd1114 | 4.71 GiB | Unused | UpToDate | +------------------------------------------------------------------------------------------------------------------------------------+
@ghernadi Today I fixed a few more resources that way. Thus I just thinking shouldn't linstor update DrbdCurrentGi when updating from 1 to 2 replicas?
Same here:
# linstor v l -r pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ worker-0 ┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ thindata ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ 1.56 GiB ┊ Unused ┊ UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor rd ap --place-count 2 pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
Resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' successfully autoplaced on 1 nodes
Details:
Used nodes (storage pool name): 'worker-1 (thindata)'
SUCCESS:
Successfully set property key(s): StorPoolName
INFO:
Tie breaker resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' created on DfltDisklessStorPool
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'suspend-io' by auto-quorum
SUCCESS:
Added peer(s) 'worker-1' to resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'system-0'
SUCCESS:
Added peer(s) 'worker-1' to resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'worker-0'
SUCCESS:
Created resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'worker-1'
SUCCESS:
Description:
Resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'worker-1' ready
Details:
Auto-placing resource: pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
SUCCESS:
Created resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'system-0'
SUCCESS:
Added peer(s) 'system-0' to resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'worker-0'
SUCCESS:
Added peer(s) 'system-0' to resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'worker-1'
SUCCESS:
Description:
Resource 'pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618' on 'system-0' ready
Details:
Auto-placing resource: pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
# linstor v l -r pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ system-0 ┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ DfltDisklessStorPool ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ ┊ Unused ┊ TieBreaker ┊
┊ worker-0 ┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ thindata ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ 1.56 GiB ┊ Unused ┊ UpToDate ┊
┊ worker-1 ┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ thindata ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ 1.02 MiB ┊ Unused ┊ UpToDate ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l -r pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ system-0 ┊ 7002 ┊ Unused ┊ Connecting(worker-0) ┊ TieBreaker ┊ 2023-05-10 16:08:10 ┊
┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ worker-0 ┊ 7002 ┊ InUse ┊ Connecting(worker-1,system-0) ┊ UpToDate ┊ 2023-05-10 14:40:45 ┊
┊ pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 ┊ worker-1 ┊ 7002 ┊ Unused ┊ Connecting(worker-0) ┊ UpToDate ┊ 2023-05-10 16:08:11 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@worker-0:/# drbdadm get-gi pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
7A27F06E04392EFE:CEB45FDF8F4207B8:3E8EDE8BE26A9AA8:90E710EE0B3F6430:1:1:1:1:0:0:0:0:0:0:0:0
7A27F06E04392EFE:CEB45FDF8F4207B8:3E8EDE8BE26A9AA8:90E710EE0B3F6430:1:1:1:1:0:0:0:0:0:0:0:0
root@worker-1:/# drbdadm get-gi pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618
2C72807F6D509092:0000000000000000:0000000000000000:0000000000000000:1:1:0:0:0:0:0:0:0:0:0:0
2C72807F6D509092:0000000000000000:0000000000000000:0000000000000000:1:1:0:0:0:0:0:0:0:0:0:0
# linstor vd lp pvc-0da20f46-1f08-4125-8ecc-b3f5e736e618 0
╭───────────────────────────────────────────────────────────────╮
┊ Key ┊ Value ┊
╞═══════════════════════════════════════════════════════════════╡
┊ DrbdCurrentGi ┊ 2C72807F6D509093 ┊
┊ DrbdOptions/Disk/discard-zeroes-if-aligned ┊ yes ┊
┊ DrbdOptions/Disk/rs-discard-granularity ┊ 65536 ┊
╰───────────────────────────────────────────────────────────────╯
DRBD documenatation says:
Note that the current implementation uses the lowest bit to encode the role of the node (Primary/Secondary). Therefore, the lowest bit might be different on distinct nodes even if they are considered to have the same data generation.
Isn't that case?
Ahh the last report was wrong due to missing connectivity between the nodes, I was confused by states UpToDate on both of them