linux icon indicating copy to clipboard operation
linux copied to clipboard

Raspberry Pi 4 : USB3 SSD Connected via USB3 Hub : BOOT Files Corrupted Following Reboot

Open reraikes opened this issue 3 years ago • 84 comments
trafficstars

Describe the bug

Files written to the FAT32 BOOT filesystem of a USB3 SSD connected via a USB3 hub sporadically get corrupted when followed immediately by a reboot. This issue does not occur when the SSD is connected directly to one of the Raspberry Pi 4 USB3 ports. This issue only occurs on the FAT32 BOOT filesystem, never on the EXT4 ROOT filesystem. A 'sync' command prior to the reboot does not eliminate the issue. This issue does not occur if the hub is connected to one of the Raspberry Pi 4 USB2 ports.

Steps to reproduce the behaviour

Test environment:

  1. Raspbberry Pi 4B 4GB (revision 1.1 or 1.2)

  2. SSD: Samsung EVO 860 or Corsair Neutron

  3. SATA to USB3 adapter: Asmedia-based (from 5 different manufacturers) [Bus 002 Device 003: ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge] or Seagate [Bus 002 Device 003: ID 0bc2:50a0 Seagate RSS LLC FA GoFlex Desk]

  4. Powered USB3 Hub: Realtek-based [Bus 002 Device 003: ID 0bda:0411 Realtek Semiconductor Corp. Hub / Bus 001 Device 004: ID 0bda:5411 Realtek Semiconductor Corp. RTS5411 Hub] or VIA Labs-based [Bus 002 Device 004: ID 2109:0817 VIA Labs, Inc. USB3.0 Hub / Bus 001 Device 005: ID 2109:2817 VIA Labs, Inc. USB2.0 Hub]

  5. Rapberry Pi OS Bullseye booted on the SSD (not running on an SD card)

The attached 'test' script performs the following actions:

  1. Copy the entire file structure from /boot to /SAVED-BOOT
  2. Copy the entire file structure from /SAVED-BOOT to /boot
  3. Run 'diff -r /SAVED-BOOT /boot' to ensure /boot is identical to /SAVED-BOOT
  4. Reboot

Following the reboot, executing a 'diff -r /SAVED-BOOT /boot' sporadically reveals that one or more /boot files are corrupt (frequency and degree of corruption varies). test.zip

Device (s)

Raspberry Pi 4 Mod. B

System

Raspberry Pi reference 2021-10-30 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 288b21fc27e128ea6b330777aca68e0061ebf4fe, stage2

Jan 20 2022 13:56:48 Copyright (c) 2012 Broadcom version bd88f66f8952d34e4e0613a85c7a6d3da49e13e2 (clean) (release) (start)

Linux raspberrypi 5.10.92-v7l+ #1514 SMP Mon Jan 17 17:38:03 GMT 2022 armv7l GNU/Linux

Logs

No response

Additional context

No response

reraikes avatar Jan 25 '22 04:01 reraikes

This issue also occurs using a Seagate BarraCuda HD (ST4000LM024), but with less frequency than an SSD.

reraikes avatar Jan 27 '22 09:01 reraikes

The firmware isn't writing to the boot partition, and the hardware doesn't know the difference between the boot partition and the rest of the drive, so the problem is likely to be in the kernel or higher.

I would start by putting an explicit sync in your test script after the copy. You can also use sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" to cause any cached files to be invalidated, meaning that your diff will actually be checking the drive contents.

pelwell avatar Jan 27 '22 09:01 pelwell

@pelwell was this comment intended for https://github.com/raspberrypi/linux/pull/4812

popcornmix avatar Jan 27 '22 12:01 popcornmix

Yes.

pelwell avatar Jan 27 '22 13:01 pelwell

The problem occurs when a user writes files to the FAT32 filesystem in the BOOT partition (/boot) and then immediately reboots.

As originally stated, 'sync' commands have no effect on the problem (a reboot should implicitly flush all buffers).

I will try:

sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

prior to the reboot and report whether it eliminates the file corruption that shows up following the reboot.

reraikes avatar Jan 27 '22 18:01 reraikes

The point is to empty the cache before the verification, to ensure that the files are being read from the drive.

pelwell avatar Jan 27 '22 19:01 pelwell

The point is to empty the cache before the verification, to ensure that the files are being read from the drive.

Adding the 'drop_caches' allows the problem to be observed without the need for a reboot and eliminates reboots as the source of the problem. The problem appears to simply be intermittent disk write failures to the FAT32 filesystem when a USB3 hub is used.

The problem occurs when a USB3 SSD or HD is connected via a powered USB3 hub (Realtek- or VIA Labs-based) but never occurs when it's connected directly to one of the Raspberry Pi 4 USB ports.

The updated 'test' script attached to this comment contains the added 'drop_caches' and reliably displays the intermittent failures without the need for a reboot and manual 'diff' afterward. test.zip

reraikes avatar Jan 28 '22 03:01 reraikes

May or may not be relevant: https://forums.raspberrypi.com/viewtopic.php?t=244948

XECDesign avatar Jan 28 '22 09:01 XECDesign

As was asked on that thread:

What does hdparm -W report? Do you get the same corruption if you disable the drive write cache with hdparm -W0?

pelwell avatar Jan 28 '22 09:01 pelwell

What does hdparm -W report?

write-caching = 1 (on)

Do you get the same corruption if you disable the drive write cache with hdparm -W0?

Yes. The problem is unchanged with 'write-caching = 0 (off)'.

reraikes avatar Jan 28 '22 19:01 reraikes

Does disabling LPM on the device make a difference?

From lsusb find the vid:pid pair for the device, for example:

$ lsusb
Bus 002 Device 003: ID 05e3:0743 Genesys Logic, Inc. SDXC and microSDXC CardReader
Bus 002 Device 004: ID 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet
Bus 002 Device 002: ID 05e3:0626 Genesys Logic, Inc. USB3.1 Hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 05e3:0610 Genesys Logic, Inc. Hub
Bus 001 Device 002: ID 2109:3431 VIA Labs, Inc. Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

The mass-storage device in question is 05e3:0743

Add the following parameter to /boot/cmdline.txt: usbcore.quirks=05e3:0743:k

And reboot, then run your test.

P33M avatar Feb 02 '22 14:02 P33M

Bus 002 Device 002: ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ...

Adding 'usbcore.quirks=174c:55aa:k' to /boot/cmdline.txt has no effect on the problem using the attached 'test' script:

Ok to run test (y/n)? y

Saving BOOT partition

Restoring BOOT partition

Syncing + Dropping Caches

Comparing BOOT partition Binary files /SAVED-BOOT/overlays/spi5-1cs.dtbo and /boot/overlays/spi5-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi5-2cs.dtbo and /boot/overlays/spi5-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-1cs.dtbo and /boot/overlays/spi6-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-2cs.dtbo and /boot/overlays/spi6-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio35-39.dtbo and /boot/overlays/spi-gpio35-39.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio40-45.dtbo and /boot/overlays/spi-gpio40-45.dtbo differ Binary files /SAVED-BOOT/overlays/spi-rtc.dtbo and /boot/overlays/spi-rtc.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306.dtbo and /boot/overlays/ssd1306.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306-spi.dtbo and /boot/overlays/ssd1306-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1331-spi.dtbo and /boot/overlays/ssd1331-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1351-spi.dtbo and /boot/overlays/ssd1351-spi.dtbo differ Binary files /SAVED-BOOT/overlays/superaudioboard.dtbo and /boot/overlays/superaudioboard.dtbo differ Binary files /SAVED-BOOT/overlays/sx150x.dtbo and /boot/overlays/sx150x.dtbo differ

Restored BOOT partition is corrupt

test.zip

reraikes avatar Feb 02 '22 19:02 reraikes

Can you put a sample of those corrupted files and the originals in one or two zip files and attach them? I'm curious to see the nature of the corruption.

pelwell avatar Feb 02 '22 19:02 pelwell

Ok to run test (y/n)? y

Saving BOOT partition

Restoring BOOT partition

Syncing + Dropping Caches

Comparing BOOT partition Binary files /SAVED-BOOT/overlays/spi5-1cs.dtbo and /boot/overlays/spi5-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi5-2cs.dtbo and /boot/overlays/spi5-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-1cs.dtbo and /boot/overlays/spi6-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-2cs.dtbo and /boot/overlays/spi6-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio35-39.dtbo and /boot/overlays/spi-gpio35-39.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio40-45.dtbo and /boot/overlays/spi-gpio40-45.dtbo differ Binary files /SAVED-BOOT/overlays/spi-rtc.dtbo and /boot/overlays/spi-rtc.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306.dtbo and /boot/overlays/ssd1306.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306-spi.dtbo and /boot/overlays/ssd1306-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1331-spi.dtbo and /boot/overlays/ssd1331-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1351-spi.dtbo and /boot/overlays/ssd1351-spi.dtbo differ Binary files /SAVED-BOOT/overlays/superaudioboard.dtbo and /boot/overlays/superaudioboard.dtbo differ

Restored BOOT partition is corrupt

files.zip

reraikes avatar Feb 02 '22 21:02 reraikes

Interesting - the corrupted files appear to contain valid chunks of overlays in a scrambled order. Running md5sum on individual sectors shows some whole sectors migrating between files. For example, the first sector of the BAD spi5-2cs.dtbo is identical to the third sector of the GOOD spi5-2cs.dtbo and spi6-2cs.dtbo. There would probably be more evidence of duplication if not for the fact that the short files mean a large percentage of the sectors are not used in their entirety.

The fact that the ext4 partition never shows this problem suggests this is the result of an interaction between the FAT filesystem and the SSD. Does sudo fsck.vfat /dev/sda1 (or whatever the device name is) show any problems after the copy?

pelwell avatar Feb 03 '22 10:02 pelwell

Run from an SD card:

root@raspberrypi:~# fsck -f /dev/sda1 fsck from util-linux 2.36.1 fsck.fat 4.2 (2021-01-31) /dev/sda1: 283 files, 61039/516190 clusters

reraikes avatar Feb 03 '22 18:02 reraikes

This issue is definitely related to writing files to a FAT32 filesystem if the USB3 SSD/HD is connected via a USB3 hub (the problem does not occur if the USB3 SSD/HD is connected directly to a Raspberry Pi 4 USB3 port):

  1. The problem occurs with any FAT32 filesystem on any partition (for example, a /dev/sda3 partition formatted as FAT32), not just the BOOT partition (/dev/sda1).

  2. The problem goes away if the FAT32 filesystem is reformatted to EXT4.

It appears there is a bug in the Raspberry Pi 4 USB3 driver when a USB3 hub is used in conjunction with a FAT32 filesystem.

reraikes avatar Feb 05 '22 22:02 reraikes

Does an explicit unmount of the target filesystem before reboot prevent corruption from happening? What happens if you mount the fat32 partition with "-o flush" and do the test?

P33M avatar Feb 07 '22 10:02 P33M

I'm using your second test script on a fresh (apt updated) Bullseye install. It's on a 2GB Pi 4 but that shouldn't matter. I'm using an SSD with a standard Raspberry Pi OS partition layout - /dev/sda1 is a 512M fat32 partition - and the fat32 partition is mounted on e.g. /home/pi/fat32 in place of the /boot location.

I don't see any filesystem corruption. Can you clarify

  • Where the fat32 FS is being mounted
  • Whether you have overlayFS enabled or not

P33M avatar Feb 07 '22 16:02 P33M

There is no mounting of anything nor any rebooting involved. There is no overlayFS involved.

Just a virgin Raspberry Pi OS loaded/booted/running on a USB3 SSD which is connected to a USB3 powered hub which is connected to a Raspberry Pi 4 USB3 port:

` echo "Saving BOOT partition" rm -r /SAVED-BOOT &> /dev/null mkdir /SAVED-BOOT cp -p -r -T /boot /SAVED-BOOT

echo "" echo "Restoring BOOT partition" cp --preserve=timestamps -r -T /SAVED-BOOT /boot

echo "" echo "Syncing + Dropping Caches" sync echo 3 > /proc/sys/vm/drop_caches

echo "" echo "Comparing BOOT partition" diff -r /SAVED-BOOT /boot if [ $? -ne 0 ]; then errexit "Restored BOOT partition is corrupt" fi

echo "" echo "Restored BOOT partition is intact" echo "" `

reraikes avatar Feb 07 '22 18:02 reraikes

Please post the output of mount.

P33M avatar Feb 07 '22 19:02 P33M

root@raspberrypi:~# mount /dev/sda2 on / type ext4 (rw,noatime) devtmpfs on /dev type devtmpfs (rw,relatime,size=1799320k,nr_inodes=449830,mode=755) proc on /proc type proc (rw,relatime) sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,size=786008k,nr_inodes=819200,mode=755) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k) cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime) tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime) /dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=393000k,nr_inodes=98250,mode=700)

reraikes avatar Feb 07 '22 19:02 reraikes

I can get corruption of the sort you describe to varying degrees of brokeneness if the adapter I am using has a known interop issue with UAS. The adapter can reliably boot, but using the test script has a high chance of causing transfer timeouts if there is a hub in the path. After running your script and getting a bad result, please post the full output of dmesg.

If there are errors relating to UAS, then follow the guide here and re-test:

https://forums.raspberrypi.com/viewtopic.php?t=245931

P33M avatar Feb 08 '22 11:02 P33M

The 'test' script failed 6 times in a row before I was able to get a run that didn't fail. The only thing that showed up in dmesg was:

[35096.662229] test (1893): drop_caches: 3 [35098.417446] test (1893): drop_caches: 3 [35100.206815] test (1893): drop_caches: 3 [35102.131658] test (1893): drop_caches: 3 [35104.506642] test (1893): drop_caches: 3 [35106.296305] test (1893): drop_caches: 3 [35108.065804] test (1893): drop_caches: 3

I added 'usb-storage.quirks=hhhh::hhhh:u' anyway, but it did not have any effect on this issue.

I have 6 different manufacturer's Asmedia-based SATA-to-USB adapters, all of which have proper support of UAS but exhibit this issue, but 'usb-storage.quirks' has no effect on any of them regarding this issue. I had tested all of them and ruled out UAS as the source of the problem before opening this ticket. They all work100% reliably connected directly to the Raspberry Pi 4B (no USB3 hub in the loop). I don't own any adapters with UAS problems.

This issue occurs on all MSD's, including a new Samsung EVO 860 SSD, a Corsair Neutron SSD, and a Seagate BarraCuda HD (ST4000LM024). It even occurs on faster USB3 flash drives such as a 64 GB Sandisk Extreme (but with less frequency) when a USB3 hub is used.

I have a hard time believing this is a UAS problem since only FAT32 filesystems are affected.

The attached dmesg.txt was 9 failures in a row before I was able to get a run that didn't fail.

dmesg.zip

Another test encountered 19 retries in a row before the copy succeeded without any corrupted files. This was the result in dmesg:

[ 127.459858] test (728): drop_caches: 3 [ 129.241392] test (728): drop_caches: 3 [ 130.998022] test (728): drop_caches: 3 [ 132.699318] usb 2-1.3: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd [ 132.958122] test (728): drop_caches: 3 [ 134.735403] test (728): drop_caches: 3 [ 136.591473] test (728): drop_caches: 3 [ 138.356371] test (728): drop_caches: 3 [ 140.208650] test (728): drop_caches: 3 [ 142.648625] test (728): drop_caches: 3 [ 144.413531] test (728): drop_caches: 3 [ 146.194291] test (728): drop_caches: 3 [ 147.994893] test (728): drop_caches: 3 [ 149.771858] test (728): drop_caches: 3 [ 151.556305] test (728): drop_caches: 3 [ 153.395313] test (728): drop_caches: 3 [ 155.180204] test (728): drop_caches: 3 [ 156.965902] test (728): drop_caches: 3 [ 158.707804] usb 2-1.3: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd [ 158.970973] test (728): drop_caches: 3 [ 160.752028] test (728): drop_caches: 3 [ 162.571091] test (728): drop_caches: 3

reraikes avatar Feb 08 '22 18:02 reraikes

Attached is an enhanced version of the 'test' script you've been using to reproduce the FAT32 file corruption issue.

The attached 'test-retry' script performs the same operations as the 'test' script, but by default, suppresses the output of 'diff' and retries the restoration copy until it succeeds, displaying the retry count along the way:


root@raspberrypi:~# ./test-retry

Ok to run test (y/n)? y

Saving BOOT partition

Restoring BOOT partition [ Failed / Retrying: 9 ]

BOOT partition successfully restored


This is a more convenient way to illustrate the severity of the problem and ensures you're not left with corrupted files at the conclusion of the test.

If the 'test-retry' script is run with a '--noretry' option, it functions identically to the 'test' script you've been using:


root@raspberrypi:~# ./test-retry --noretry

Ok to run test (y/n)? y

Saving BOOT partition

Restoring BOOT partition Binary files /SAVED-BOOT/overlays/spi5-1cs.dtbo and /boot/overlays/spi5-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi5-2cs.dtbo and /boot/overlays/spi5-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-1cs.dtbo and /boot/overlays/spi6-1cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi6-2cs.dtbo and /boot/overlays/spi6-2cs.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio35-39.dtbo and /boot/overlays/spi-gpio35-39.dtbo differ Binary files /SAVED-BOOT/overlays/spi-gpio40-45.dtbo and /boot/overlays/spi-gpio40-45.dtbo differ Binary files /SAVED-BOOT/overlays/spi-rtc.dtbo and /boot/overlays/spi-rtc.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306.dtbo and /boot/overlays/ssd1306.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1306-spi.dtbo and /boot/overlays/ssd1306-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1331-spi.dtbo and /boot/overlays/ssd1331-spi.dtbo differ Binary files /SAVED-BOOT/overlays/ssd1351-spi.dtbo and /boot/overlays/ssd1351-spi.dtbo differ Binary files /SAVED-BOOT/overlays/superaudioboard.dtbo and /boot/overlays/superaudioboard.dtbo differ

Restored BOOT partition is corrupt


test-retry.zip

reraikes avatar Feb 09 '22 02:02 reraikes

I can run your script dozens of times (with UAS quirk applied) without a broken filesystem happening.

Whatever the issue is, it must be down to your hardware. Booting from USB SSDs is a very common use-case, and it is recommended to use a powered hub as SSDs can draw in excess of 1.5A - causing brownouts on the Pi's USB ports. If there was a software bug, then every time people ran rpi-update or updated the apt package that pulled a new kernel in, we would have thousands of reports of corrupt boot filesystems.

[ 130.998022] test (728): drop_caches: 3
[ 132.699318] usb 2-1.3: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd

This line implies that Linux is unhappy with the device for some reason. Resetting the device will obviously cause corruption if there are pending writes.

Please attach the output of the raspinfo script.

P33M avatar Feb 09 '22 10:02 P33M

I can run your script dozens of times (with UAS quirk applied) without a broken filesystem happening.

So you're testing on a system that requires a UAS quirk to be applied for normal operation? Do you not have an adapter that properly supports UAS? I can't test using an adapter that requires a UAS quirk as I don't own one.

Whatever the issue is, it must be down to your hardware. Booting from USB SSDs is a very common use-case, and it is recommended to use a powered hub as SSDs can draw in excess of 1.5A - causing brownouts on the Pi's USB ports.

But I have absolutely zero problems when no powered hub is used. It's a powered hub (multiple brands) that's causing the problem, not eliminating it. I can use any combination of multiple Raspberry Pi 4B's, multiple adapters, multiple hubs, multiple MSD's, and the problem only occurs on a partition that is formatted FAT32, which can be located anywhere on the drive, not just /dev/sda1, and only when a hub is used.

This line implies that Linux is unhappy with the device for some reason. Resetting the device will obviously cause corruption if there are pending writes.

I agree the resets look out of place, but as many test runs illustrate, the file corruption often occurs without any intervening resets.

raspinfo.zip

reraikes avatar Feb 09 '22 16:02 reraikes

As further evidence this issue is not related to a particular Raspberry Pi 4B nor a particular USB3 MSD nor a particular USB3 SATA-to-USB adapter nor a particular USB3 hub, I ran tests using a 64G USB3 Sandisk Extreme flash drive.

The first tests were with no hub plugged into the Raspberry Pi 4B (rev 1.2). The USB3 flash drive was plugged directly into one of the Pi's USB3 ports (both USB3 ports exhibit the problem and no other USB ports were in use). I ran 12 tests with no corrupted files encountered (corrupted files never occur with a USB3 MSD plugged directly into one of the Pi's USB3 ports):

[ 148.134491] test (700): drop_caches: 3 [ 183.063953] test (708): drop_caches: 3 [ 190.340006] test (716): drop_caches: 3 [ 203.408303] test (724): drop_caches: 3 [ 216.643790] test (732): drop_caches: 3 [ 231.370521] test (740): drop_caches: 3 [ 241.080061] test (749): drop_caches: 3 [ 248.930747] test (757): drop_caches: 3 [ 270.774901] test (765): drop_caches: 3 [ 274.984036] test (773): drop_caches: 3 [ 280.912935] test (781): drop_caches: 3 [ 304.480693] test (789): drop_caches: 3

See the attached dmesg-no-hub.txt.

Then I ran the same tests with the same USB3 flash drive plugged into a USB3 hub (VIA Labs-based) which was plugged into the same Pi's same USB3 port. I ran the same 12 tests with no corrupted files encountered until the very last one:

[ 40.038678] test (713): drop_caches: 3 [ 51.059094] test (721): drop_caches: 3 [ 63.064206] test (731): drop_caches: 3 [ 103.491772] usb 2-1.4: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd [ 104.133090] test (739): drop_caches: 3 [ 115.591212] test (748): drop_caches: 3 [ 123.479040] test (756): drop_caches: 3 [ 160.838845] usb 2-1.4: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd [ 163.515253] test (764): drop_caches: 3 [ 176.261146] test (772): drop_caches: 3 [ 202.890249] test (780): drop_caches: 3 [ 211.017506] test (788): drop_caches: 3 [ 219.464898] test (796): drop_caches: 3 [ 227.152182] usb 2-1.4: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd [ 227.803767] test (804): drop_caches: 3 [ 229.869840] test (804): drop_caches: 3

See the attached dmesg-with-hub.txt.

I then substituted a Realtek-based USB3 hub for the VIA Labs-based USB3 hub used in the previous tests, with no change.

I then substituted another Raspberry Pi 4B (rev 1.1) for the Raspberry Pi 4B (rev 1.2) used in the previous tests, with no change.

The problem appears to be related to the Raspberry Pi 4B handling of MSD's connected to USB3 hubs when writing large quantities of files to FAT32 filesystems.

dmesg-more.zip

reraikes avatar Feb 09 '22 20:02 reraikes

There is a bug here. I swapped to using a mass-storage-only drive caddy, but on a default install it still didn't register corruption. I noticed that you are using the v8 kernel (64bit) and I do get a failing test in this case. It happens on both 5.10 and 5.15.

Using a protocol analyser, the device is being reset because the previous write command failed and the endpoint stalled. The write fails because the amount of data transferred to the OUT endpoint exceeds the length in the CDB - and this appears to only happen if there is a large contiguous transfer - in my case it's failing on a 710KiB write length, and an excess 8KiB of data was sent.

So to trigger the bug we need at least:

  • A hub in the datapath
  • large contiguous blocks of data queued for a write
  • Possibly with multiple outstanding writes
  • With an AARCH64 kernel?

Can you repeat your test with the 32-bit kernel. It's possible that whatever set of circumstances is required to trigger the bug are simply less likely on ARMv7 (and I'd like to eliminate it as a contributing factor).

P33M avatar Feb 10 '22 17:02 P33M

Both 32- and 64-bit versions of both Buster and Bullseye exhibit the problem. The problem is equally severe here in all 4 environments. This bug has been present for a very very long time.

reraikes avatar Feb 10 '22 17:02 reraikes