drbd icon indicating copy to clipboard operation
drbd copied to clipboard

Bug in drbd 9.1.5 on CentOS 7

Open izyk opened this issue 2 years ago • 8 comments

Hello. I don't sure it's drbd problem, but, after I upgrade packages kmod-drbd90-9.1.4 -> kmod-drbd90-9.1.5 from elrepo. I have а error in my message on a md raid.

My block stack is: mdraid -> lvm -> drbd -> vdo -> lvm

I have trouble only with raid devices with chunks (usually 512K size) raid0, raid10. With raid1 no problem. Please, could you give me a hint where could be the error?

Feb 11 02:48:58 arh kernel: md/raid10:md124: make_request bug: can't convert block across chunks or bigger than 512k 2755544 32 Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: disk( UpToDate -> Failed ) Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: Local IO failed in drbd_request_endio. Detaching... Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: local READ IO error sector 2752472+64 on dm-3 Feb 11 02:48:58 arh kernel: drbd r1/0 drbd2: sending new current UUID: 3E82544B6FC832F1 Feb 11 02:48:59 arh kernel: drbd r1/0 drbd2: disk( Failed -> Diskless ) Feb 11 02:48:59 arh kernel: drbd r1/0 drbd2: Should have called drbd_al_complete_io(, 4294724168, 4096), but my Disk seems to have failed :(

After this, the primary worked in diskless mode. If primary on raid1, all works normal, and secondary UpToDate, even if the secondary is on a raid0.

drbd90-utils-9.19.1-1.el7.elrepo.x86_64 kmod-drbd90-9.1.5-1.el7_9.elrepo.x86_64

I don't try revert to kmod-9.1.4 yet, but with previous kernel and 9.1.5 I get the same.

izyk avatar Feb 11 '22 06:02 izyk

With 9.1.4 all right as before. "Online verify done" without errors. uname -a Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

modinfo drbd filename: /lib/modules/3.10.0-1160.49.1.el7.x86_64/weak-updates/drbd90/drbd.ko alias: block-major-147-* license: GPL version: 9.1.4 description: drbd - Distributed Replicated Block Device v9.1.4 author: Philipp Reisner [email protected], Lars Ellenberg [email protected] retpoline: Y rhelversion: 7.9 srcversion: DC4A3A79803F1566C1F7ABE depends: libcrc32c vermagic: 3.10.0-1160.el7.x86_64 SMP mod_unload modversions signer: The ELRepo Project (http://elrepo.org): ELRepo.org Secure Boot Key sig_key: F3:65:AD:34:81:A7:B2:0E:34:27:B6:1B:2A:26:63:5B:83:FE:42:7B sig_hashalgo: sha256 parm: enable_faults:int parm: fault_rate:int parm: fault_count:int parm: fault_devs:int parm: disable_sendpage:bool parm: allow_oos:DONT USE! (bool) parm: minor_count:Approximate number of drbd devices (1-255) (uint) parm: usermode_helper:string parm: protocol_version_min:drbd_protocol_version

izyk avatar Feb 14 '22 06:02 izyk

I can confirm this problem with kmod-drbd90-9.1.5-1.el7_9.elrepo.x86_64 with both kernel-3.10.0-1160.53.1 and kernel-3.10.0-1160.49.1. Also that it occurs only for raid10 mdraid devices. The previous 9.1.4 kmod works with both of those kernels plus the latest kernel-3.10.0-1160.59.1. No errors are logged for the raid10 device when using the 9.1.4 kmod, only when using the 9.1.5 version.

Jaybus2 avatar Mar 08 '22 15:03 Jaybus2

Does this issue also occur with 9.1.6? Does it occur when you build DRBD from a release tarball instead of using the elrepo packages?

JoelColledge avatar Apr 07 '22 09:04 JoelColledge

TLDR : DRBD 9.1.7 NO-GO on CentOS7 if using raid10 md arrays for the underlying drbd disk.

Just want to state I have the exact same issue. My cluster failed after upgrading from 9.0.x(don't know the exact version) to 9.1.7. Running on 3.10.0-1160.66.1.el7.x86_64 kernel. Get this issue in syslog when using an raid10 MD array for drbd (works with raid1 or linear. Also works if running directly on the disks, aka no md raid):

[ 1996.269915] drbd storage: Committing cluster-wide state change 2875711901 (0ms) [ 1996.269930] drbd storage: role( Secondary -> Primary ) [ 1996.269933] drbd storage/0 drbd0: disk( Inconsistent -> UpToDate ) [ 1996.270004] drbd storage/0 drbd0: size = 32 GB (33532892 KB) [ 1996.286479] drbd storage: Forced to consider local data as UpToDate! [ 1996.288057] drbd storage/0 drbd0: new current UUID: 2BDAB23564612AE9 weak: FFFFFFFFFFFFFFFD [ 2010.790278] md/raid10:md200: make_request bug: can't convert block across chunks or bigger than 512k 33530432 256 [ 2010.790307] drbd storage/0 drbd0: disk( UpToDate -> Failed ) [ 2010.790359] drbd storage/0 drbd0: Local IO failed in drbd_request_endio. Detaching... [ 2010.790455] drbd storage/0 drbd0: local WRITE IO error sector 33530432+512 on md200 [ 2010.792848] drbd storage/0 drbd0: disk( Failed -> Diskless ) [ 2277.261791] drbd storage: Preparing cluster-wide state change 391032618 (1->-1 3/2)

I also tried compiling it from sources, same issue.

Tried on Rocky Linux 8, works like a charm. I could not found a way to sign the kernel module on Rocky 8, so I disabled UEFI secure boot signing like it's stated here: https://askubuntu.com/questions/762254/why-do-i-get-required-key-not-available-when-install-3rd-party-kernel-modules

Writing this in case anyone encounters same issues as me and hope they will not lose 20 hours of debugging as I did.

mitzone avatar Jun 07 '22 22:06 mitzone

I'm still getting the "make_request bug: can't convert block across chunks or bigger than 512k" with Centos 7 kernel 3.10.0-1160.71.1 using kernel module from DRBD 9.1.7 when the backing store is an LVM partition where the PV is a md RAID device. Also, the 9.1.7 kmod has the same error with several previous versions of the Centos kernel. By contrast, the 9.1.4 kmod works with all Centos kernels since at least 3.10.0-1160.49.1.

Btw, other LVs in that same VG (also on the same md RAID10 PV), have no issues. Only the LVs being used as DRBD backing storage and only with DRBD > 9,1,4. Something in newer DRBD versions kmod breaks md devices. I see no LVM messages, only the md error, and of course that leads to the DRBD messages re. moving from UpToDate to Failed to Diskless.

Jaybus2 avatar Jul 27 '22 13:07 Jaybus2

Update: This issue still persists in 9.1.12. The issue always starts with a md raid10 error: md/raid10:md200: make_request bug: can't convert block across chunks or bigger than 512k 33530432 256 It does not happen when the storage for the DRBD device is on md raid1, only for md raid10. I am not setup to test any other raid levels.

Info on the DRBD device (and underlying LVM and md raid10 devices) causing the issue for me is below. Note that other LVs on this same raid10 PV that are locally mounted or used for iSCSI (ie. not used for DRBD backing storage) work just fine. Also note that the raid10 device chink size is 512k

[root@cnode3 drbd.d]# uname -r 3.10.0-1160.71.1.el7.x86_64

[root@cnode3 drbd.d]# cat r13_access_home.res resource drbd_access_home { meta-disk internal; on cnode3 { node-id 0; device /dev/drbd13 minor 13; disk /dev/vg_b/lv_access_home; address ipv4 10.0.99.3:7801; } on cnode2 { node-id 1; device /dev/drbd13 minor 13; disk /dev/vg_b/lv_access_home; address ipv4 10.0.99.2:7801; } }

[root@cnode3 ~]# lvdisplay /dev/vg_b/lv_access_home --- Logical volume --- LV Path /dev/vg_b/lv_access_home LV Name lv_access_home VG Name vg_b LV UUID jiTZLD-CGmp-x9W3-AxcF-kcjH-GWgW-DsDJ0I LV Write Access read/write LV Creation host, time cnode3, 2017-09-06 14:33:06 -0400 LV Status available

open 2

LV Size 350.00 GiB Current LE 89600 Segments 2 Allocation inherit Read ahead sectors auto

  • currently set to 4096 Block device 253:18

[root@cnode3 ~]# pvdisplay /dev/md125 --- Physical volume --- PV Name /dev/md125 VG Name vg_b PV Size <3.64 TiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 953799 Free PE 159175 Allocated PE 794624 PV UUID CY3lVe-5E4f-fIS5-RmMv-HYI1-nH0Z-Iil0Ig

[root@cnode3 ~]# mdadm -D /dev/md125 /dev/md125: Version : 1.2 Creation Time : Wed Jan 13 12:12:53 2016 Raid Level : raid10 Array Size : 3906764800 (3.64 TiB 4.00 TB) Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Wed Dec 21 10:13:45 2022
         State : active, checking
Active Devices : 4

Working Devices : 4 Failed Devices : 0 Spare Devices : 0

        Layout : near=2
    Chunk Size : 512K

Consistency Policy : bitmap

  Check Status : 33% complete

          Name : cnode3:3  (local to host cnode3)
          UUID : 1204deeb:5393b7c0:7630ffc9:b6f7d835
        Events : 921962

Number   Major   Minor   RaidDevice State
   0       8       17        0      active sync set-A   /dev/sdb1
   1       8        1        1      active sync set-B   /dev/sda1
   2       8       33        2      active sync set-A   /dev/sdc1
   3       8       49        3      active sync set-B   /dev/sdd1

Jaybus2 avatar Dec 21 '22 15:12 Jaybus2

This bug still persists in 9.1.13, with a caveat. As of 9.1.13 it works with a md raid10 backing device as long as the DRBD device is secondary and resync works at startup. However, when the DRBD device is made primary, the same errors persists. Tested on Centos 7 with latest kernel 3.10.0-1160.83.1.el7. Kernel log messages:

Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: Preparing cluster-wide state change 3028602266 (1->-1 3/1) Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: State change 3028602266: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: Committing cluster-wide state change 3028602266 (0ms) Mar 1 08:43:12 cnode2 kernel: drbd drbd_access_home: role( Secondary -> Primary ) Mar 1 08:43:39 cnode2 kernel: md/raid10:md127: make_request bug: can't convert block across chunks or bigger than 256k 448794880 132 Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: disk( UpToDate -> Failed ) Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: Local IO failed in drbd_request_endio. Detaching... Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: local READ IO error sector 29362432+264 on ffff9fcff9a389c0 Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: sending new current UUID: 9C66E258C0F9F361 Mar 1 08:43:39 cnode2 kernel: drbd drbd_access_home/0 drbd13: disk( Failed -> Diskless )

Jaybus2 avatar Mar 01 '23 19:03 Jaybus2

I faced similiar issue with 9.1.16. The backing device is md raid0 device for my case. After I detached the device from the primary node and attached again, then the same error occurs.

josedev-union avatar Sep 27 '23 14:09 josedev-union