linux icon indicating copy to clipboard operation
linux copied to clipboard

diskcorruption on 6.12.y

Open folkertvanheusden opened this issue 1 year ago • 10 comments

Describe the bug

Revision 521f2baed818c04981fd61b275c996a8ef03b833 of branch rpi-6.12.y gives massive disk-corruption when realtime-kernel is enabled.

Steps to reproduce the behaviour

Configure kernel for real-time.

Device (s)

Raspberry Pi Zero 2 W, Raspberry Pi 3 Mod. B

System

raspbian 64bit lite

Logs

Not directly available. If this problem is of interest, I can see if I can reproduce it. Errors I saw were related to directory-nodes being full.

Additional context

No response

folkertvanheusden avatar Nov 25 '24 21:11 folkertvanheusden

Realtime support is new to 6.12:

  1. Have you add a non-corrupting 6.12 realtime kernel before the indicated commit?
  2. Does a non-realtime build of 6.12 work for you?
  3. Have you add a non-corrupting 6.11 realtime (with patches) kernel?
  4. Have you tried realtime kernels before?

pelwell avatar Nov 25 '24 21:11 pelwell

  1. On which medium are you seeing corruptions - SD card or some external storage?

pelwell avatar Nov 25 '24 21:11 pelwell

  1. no, it was the first I tried
  2. seems to fail as well. see log below.
  3. no
  4. no. the RT patches were only merged in 6.12 I believe.
  5. sd cards (I tried 3 different cards, all 32 GB in size)

Both the realtime and non-realtime 6.12 kernel fail:

[  103.850530] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  103.850566] EXT4-fs error (device mmcblk0p2): htree_dirblock_to_tree:1083: inode #55844: block 2: comm apt: Directory block failed checksum
[  103.850777] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  103.850791] EXT4-fs error (device mmcblk0p2): htree_dirblock_to_tree:1083: inode #55844: block 2: comm apt: Directory block failed checksum
[  104.173859] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  104.173892] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm apt: Directory block failed checksum
[  104.174135] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  104.174149] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm apt: Directory block failed checksum
[  104.236603] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  104.236680] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm apt: Directory block failed checksum
[  104.237061] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm apt: No space for directory leaf checksum. Please run e2fsck -D.
[  104.237075] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm apt: Directory block failed checksum
[  104.303747] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm http: No space for directory leaf checksum. Please run e2fsck -D.
[  104.303791] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm http: Directory block failed checksum
[  104.309206] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm http: No space for directory leaf checksum. Please run e2fsck -D.
[  104.309240] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm http: Directory block failed checksum
[  104.316966] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm http: No space for directory leaf checksum. Please run e2fsck -D.
[  104.317007] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm http: Directory block failed checksum
[  104.323162] EXT4-fs warning (device mmcblk0p2): ext4_dirblock_csum_verify:406: inode #55844: comm http: No space for directory leaf checksum. Please run e2fsck -D.
[  104.323196] EXT4-fs error (device mmcblk0p2): ext4_dx_find_entry:1753: inode #55844: block 2: comm http: Directory block failed checksum

folkertvanheusden avatar Nov 26 '24 08:11 folkertvanheusden

Both the realtime and non-realtime 6.12 kernel fail:

I think you'll need to back off to a known good point, confirm that is reliable then just update the kernel.

Start with a clean install of RPiOS Bookworm 64-bit lite. Update it (sudo apt update && sudo apt full-upgrade).

Confirm sdcard is reliable with no complaints in dmesg when running your normal workloads.

Now update to our build of the 6.12 kernel. sudo rpi-update next reboot and report if any sdcard corruption issues.

popcornmix avatar Nov 27 '24 12:11 popcornmix

That build of the kernel works fine. What would be the procedure for getting a kernel with CONFIG_PREEMPT_RT instead?

folkertvanheusden avatar Nov 27 '24 13:11 folkertvanheusden

You said

Both the realtime and non-realtime 6.12 kernel fail:

So I think the first issue to resolve is if the default 6.12 (non-realtime) kernel has an issue with corrupting the sdcard. Once we have an answer to that, we can consider enabling RT. But you don't want to be changing too many things at once.

popcornmix avatar Nov 27 '24 14:11 popcornmix

You said

Both the realtime and non-realtime 6.12 kernel fail:

So I think the first issue to resolve is if the default 6.12 (non-realtime) kernel has an issue with corrupting the sdcard. Once we have an answer to that, we can consider enabling RT. But you don't want to be changing too many things at once.

Yes, but not the one from rpi-update next. I did a diff of the rpi-update next-version of .config and the one I came up with and see quite a few possible reasons for my version to fail. So 'm going to try only changing the scheduling parameters in the rpi-update next-version and see if that helps.

folkertvanheusden avatar Nov 27 '24 14:11 folkertvanheusden

@folkertvanheusden were you able to reproduce the errors with a more recent 6.12 version? Haven't seen such errors on PREEMPT RT enabled 6.12 builds yet.

Tested our tree (with some vendor patches, mostly targeting dt for our hardware platform) on CM5 with eMMC, pi500 with the stock rpi sd card and rpi2w. So far none of these has reported any ext4 warnings - even under high (io) load tests. Are there any other warnings / errors in kernel log? There are still some quirks with rt on rpi (dwc_otg for example needs patching or switch to dwc2)

nbuchwitz avatar Jan 03 '25 10:01 nbuchwitz

Hi,

I haven't tried it further, maybe later.

On Fri, Jan 3, 2025 at 11:14 AM Nicolai @.***> wrote:

@folkertvanheusden https://github.com/folkertvanheusden were you able to reproduce the errors with a more recent 6.12 version? Haven't seen such errors on PREEMPT RT enabled 6.12 builds yet.

Tested our tree (with some vendor patches, mostly targeting dt for our hardware platform) on CM5 with eMMC, pi500 with the stock rpi sd card and rpi2w. So far none of these has reported any ext4 warnings - even under high (io) load tests. Are there any other warnings / errors in kernel log? There are still some quirks with rt on rpi (dwc_otg for example needs patching or switch to dwc2)

— Reply to this email directly, view it on GitHub https://github.com/raspberrypi/linux/issues/6492#issuecomment-2568989673, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUN5IW76IV5EMCBLL3MVPOD2IZPI7AVCNFSM6AAAAABSO5A44OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYHE4DSNRXGM . You are receiving this because you were mentioned.Message ID: @.***>

folkertvanheusden avatar Jan 03 '25 10:01 folkertvanheusden

Hi, I also had disc corruption issues, which occasionally also disappeared after reboots. While having an open console, I had serveral times some bus errors, which I started to examine. Those occured sporadically, without any logical reason.

I deactivated all overclocking and the errors disappeared. Now all the Pi Zero 2W raspis work as expexted. The 6.12.x kernels seem to be spiky with overclocking.

Hope this helps someone :)

axeljerabek avatar Apr 03 '25 16:04 axeljerabek