linux icon indicating copy to clipboard operation
linux copied to clipboard

USB SSD disconnects randomly: xHCI host not responding to stop endpoint command

Open tomikaa87 opened this issue 2 years ago • 97 comments

Describe the bug

I have intermittent system crashes with the following setup:

  • Raspberry Pi 4B, rev 1.1, 4 GB RAM
  • USB case: Axagon EE25-XA6 (ASMedia ASM225CM, enabled UASP and TRIM, USB 3.0)
  • SSDs tried: Samsung 840 (non-EVO or PRO) 240 GB, Corsair Force LS 120 GB
  • Official Pi 4 power supply
  • Tried with multiple, high quality USB 3.0 A-to-microB cables
  • The SoC is not overclocked, the Pi has an actively cooled Aluminium Armour case, temps are kept below 55 Celsius
  • The OS is up-to-date

The system boots fine, but crashes randomly after 6-24 hours. dmesg (logs captured via serial port) shows that the USB disk disconnects.

I've ran SSD speeds tests with dd, and successfully read the whole SSD with 300+ MB/s, also written 20+ GB into a test file without problems.

I'm running a few services in Docker, like PiHole, UniFi Controller, Prometheus + Grafana, Blynk local server, Mosquitto MQTT server.

The SSD has an extra partition for ZFS in addition to the standard boot and rootfs partitions. That ZFS volume stores the Docker containers and their data.

My other Pi 4 configuration (8 GB, rev 1.5) uses the same USB 3.0 case, but it runs stable, never had a crash like that. However, it writes less frequently to the SSD since it doesn't run Docker containers.

Steps to reproduce the behaviour

Unfortunately, there are no fix reproduction steps. I've tried disabling USB AutoSuspend via cmdline.txt, but it didn't help. I've also tried a different power supply without luck. However, the system doesn't crash if I plug the USB disk into a USB 2.0 port.

Device (s)

Raspberry Pi 4 Mod. B

System

/etc/rpi-issue

Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 27a8050c3c06e567c794620394a8c2d74262a516, stage2

vcgencmd version

Mar 24 2022 13:19:26
Copyright (c) 2012 Broadcom
version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)

uname -a

Linux services 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 GNU/Linux

sudo rpi-eeprom-update

BOOTLOADER: up to date
   CURRENT: Tue Apr 26 10:24:28 UTC 2022 (1650968668)
    LATEST: Tue Apr 26 10:24:28 UTC 2022 (1650968668)
   RELEASE: default (/lib/firmware/raspberrypi/bootloader/default)
            Use raspi-config to change the release.

  VL805_FW: Dedicated VL805 EEPROM
     VL805: up to date
   CURRENT: 000138a1
    LATEST: 000138a1

lsusb -tv

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=uas, 5000M
        ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        ID 2109:3431 VIA Labs, Inc. Hub

sudo lsusb -vd 174c:55aa

Bus 002 Device 002: ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               3.10
  bDeviceClass            0
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0         9
  idVendor           0x174c ASMedia Technology Inc.
  idProduct          0x55aa ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
  bcdDevice            1.00
  iManufacturer           2 AXAGON USB3.0 External HDD
  iProduct                3 AXAGON USB3.0 External HDD
  iSerial                 1 AAAABBBBCCCC0003
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0079
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xc0
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk-Only
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst              15
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst              15
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       1
      bNumEndpoints           4
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     98
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst              15
        MaxStreams             32
        Data-in pipe (0x03)
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst              15
        MaxStreams             32
        Data-out pipe (0x04)
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x83  EP 3 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst              15
        MaxStreams             32
        Status pipe (0x02)
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x04  EP 4 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0400  1x 1024 bytes
        bInterval               0
        bMaxBurst               0
        Command pipe (0x01)
Binary Object Store Descriptor:
  bLength                 5
  bDescriptorType        15
  wTotalLength       0x0016
  bNumDeviceCaps          2
  USB 2.0 Extension Device Capability:
    bLength                 7
    bDescriptorType        16
    bDevCapabilityType      2
    bmAttributes   0x0000f41e
      BESL Link Power Management (LPM) Supported
    BESL value     1024 us
    Deep BESL value    61440 us
  SuperSpeed USB Device Capability:
    bLength                10
    bDescriptorType        16
    bDevCapabilityType      3
    bmAttributes         0x00
    wSpeedsSupported   0x000e
      Device can operate at Full Speed (12Mbps)
      Device can operate at High Speed (480Mbps)
      Device can operate at SuperSpeed (5Gbps)
    bFunctionalitySupport   1
      Lowest fully-functional device speed is Full Speed (12Mbps)
    bU1DevExitLat          10 micro seconds
    bU2DevExitLat        2047 micro seconds
can't get debug descriptor: Resource temporarily unavailable
Device Status:     0x000d
  Self Powered
  U1 Enabled
  U2 Enabled

sudo lshw

services
    description: Computer
    product: Raspberry Pi 4 Model B Rev 1.1
    serial: 100000004c4deb14
    width: 64 bits
    capabilities: smp cp15_barrier setend swp tagged_addr_disabled
  *-core
       description: Motherboard
       physical id: 0
     *-cpu:0
          description: CPU
          product: cpu
          physical id: 1
          bus info: cpu@0
          size: 1500MHz
          capacity: 1500MHz
          capabilities: fp asimd evtstrm crc32 cpuid cpufreq
     *-cpu:1
          description: CPU
          product: cpu
          physical id: 2
          bus info: cpu@1
          size: 1500MHz
          capacity: 1500MHz
          capabilities: fp asimd evtstrm crc32 cpuid cpufreq
     *-cpu:2
          description: CPU
          product: cpu
          physical id: 3
          bus info: cpu@2
          size: 1500MHz
          capacity: 1500MHz
          capabilities: fp asimd evtstrm crc32 cpuid cpufreq
     *-cpu:3
          description: CPU
          product: cpu
          physical id: 4
          bus info: cpu@3
          size: 1500MHz
          capacity: 1500MHz
          capabilities: fp asimd evtstrm crc32 cpuid cpufreq
     *-memory
          description: System memory
          physical id: 5
          size: 3793MiB
     *-pci
          description: PCI bridge
          product: BCM2711 PCIe Bridge
          vendor: Broadcom Inc. and subsidiaries
          physical id: 0
          bus info: pci@0000:00:00.0
          version: 10
          width: 32 bits
          clock: 33MHz
          capabilities: pci pm pciexpress normal_decode bus_master cap_list
          resources: memory:600000000-6000fffff
        *-usb
             description: USB controller
             product: VL805 USB 3.0 Host Controller
             vendor: VIA Technologies, Inc.
             physical id: 0
             bus info: pci@0000:01:00.0
             version: 01
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi pciexpress xhci bus_master cap_list
             configuration: driver=xhci_hcd latency=0
             resources: irq:46 memory:600000000-600000fff
           *-usbhost:0
                product: xHCI Host Controller
                vendor: Linux 5.15.32-v8+ xhci-hcd
                physical id: 0
                bus info: usb@1
                logical name: usb1
                version: 5.15
                capabilities: usb-2.00
                configuration: driver=hub slots=1 speed=480Mbit/s
              *-usb
                   description: USB hub
                   product: USB2.0 Hub
                   vendor: VIA Labs, Inc.
                   physical id: 1
                   bus info: usb@1:1
                   version: 4.21
                   capabilities: usb-2.10
                   configuration: driver=hub maxpower=100mA slots=4 speed=480Mbit/s
           *-usbhost:1
                product: xHCI Host Controller
                vendor: Linux 5.15.32-v8+ xhci-hcd
                physical id: 1
                bus info: usb@2
                logical name: usb2
                version: 5.15
                capabilities: usb-3.00
                configuration: driver=hub slots=4 speed=5000Mbit/s
              *-usb
                   description: Mass storage device
                   product: AXAGON USB3.0 External HDD
                   vendor: AXAGON USB3.0 External HDD
                   physical id: 1
                   bus info: usb@2:1
                   logical name: scsi0
                   version: 1.00
                   serial: AAAABBBBCCCC0003
                   capabilities: usb-3.10 scsi
                   configuration: driver=uas speed=5000Mbit/s
                 *-disk
                      description: SCSI Disk
                      product: Force LS SSD
                      vendor: Corsair
                      physical id: 0.0.0
                      bus info: scsi@0:0.0.0
                      logical name: /dev/sda
                      version: 0
                      serial: 3000CCCCBBBBAAAA
                      size: 111GiB (120GB)
                      capabilities: gpt-1.00 partitioned partitioned:gpt
                      configuration: ansiversion=6 guid=91639514-c964-49a3-bb8c-90f7d0a1c96e logicalsectorsize=512 sectorsize=512
                    *-volume:0
                         description: Windows FAT volume
                         vendor: mkfs.fat
                         physical id: 1
                         bus info: scsi@0:0.0.0,1
                         logical name: /dev/sda1
                         logical name: /boot
                         version: FAT16
                         serial: 048b-0011
                         size: 253MiB
                         capacity: 255MiB
                         capabilities: fat initialized
                         configuration: FATs=2 filesystem=fat label=boot mount.fstype=vfat mount.options=rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,flush,errors=remount-ro name=Microsoft basic data state=mounted
                    *-volume:1
                         description: EXT4 volume
                         vendor: Linux
                         physical id: 2
                         bus info: scsi@0:0.0.0,2
                         logical name: /dev/sda2
                         version: 1.0
                         serial: b79fd5d8-0ff9-47f0-9d99-33e8c30786af
                         size: 39GiB
                         capacity: 39GiB
                         capabilities: journaled extended_attributes large_files huge_files dir_nlink recover 64bit extents ext4 ext2 initialized
                         configuration: created=2022-06-06 19:25:35 filesystem=ext4 label=rootfs lastmountpoint=/ modified=2022-06-07 22:28:48 mounted=2022-06-10 18:17:01 name=Linux filesystem state=clean
                    *-volume:2
                         description: EFI partition
                         physical id: 3
                         bus info: scsi@0:0.0.0,3
                         logical name: /dev/sda3
                         serial: 30f1e2a0-24bb-4f21-929c-60f405bd6719
                         capacity: 71GiB
                         configuration: name=Linux filesystem
  *-network:0
       description: Ethernet interface
       physical id: 1
       logical name: eth0
       serial: dc:a6:32:31:02:13
       size: 1Gbit/s
       capacity: 1Gbit/s
       capabilities: ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=bcmgenet driverversion=5.15.32-v8+ duplex=full ip=10.80.0.2 link=yes multicast=yes port=twisted pair speed=1Gbit/s
  *-network:1
       description: Ethernet interface
       physical id: 2
       logical name: br-482150e1dbb6
       serial: 02:42:bc:9e:88:86
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.20.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:2
       description: Ethernet interface
       physical id: 3
       logical name: docker0
       serial: 02:42:67:67:d3:22
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.17.0.1 link=no multicast=yes
  *-network:3
       description: Ethernet interface
       physical id: 4
       logical name: br-7777bfeb3730
       serial: 02:42:50:bc:ca:c7
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.21.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:4
       description: Ethernet interface
       physical id: 5
       logical name: br-d9dca40fffc1
       serial: 02:42:c2:c6:5e:d0
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.24.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:5
       description: Ethernet interface
       physical id: 6
       logical name: br-e50cc143ca3d
       serial: 02:42:81:f8:5d:2f
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.23.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:6
       description: Ethernet interface
       physical id: 7
       logical name: br-e59a5e6c415e
       serial: 02:42:dd:20:96:9a
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.19.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:7
       description: Ethernet interface
       physical id: 8
       logical name: br-1d1c33aac596
       serial: 02:42:77:1a:ef:6e
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=bridge driverversion=2.3 firmware=N/A ip=172.18.0.1 link=yes multicast=yes speed=10Gbit/s
  *-network:8
       description: Ethernet interface
       physical id: 9
       logical name: veth27303d5
       serial: 36:84:56:77:44:4e
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.240.41 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:9
       description: Ethernet interface
       physical id: a
       logical name: veth419a0fd
       serial: 96:b5:bb:a7:d5:1f
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.69.105 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:10
       description: Ethernet interface
       physical id: b
       logical name: vethd0e141a
       serial: 62:5b:95:41:fa:67
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.237.119 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:11
       description: Ethernet interface
       physical id: c
       logical name: vethd824777
       serial: 82:78:97:db:21:f7
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.2.78 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:12
       description: Ethernet interface
       physical id: d
       logical name: veth7d5d161
       serial: 7a:48:fb:cf:0a:9c
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.55.254 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:13
       description: Ethernet interface
       physical id: e
       logical name: vethaed77d8
       serial: be:5d:df:01:26:45
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.91.13 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:14
       description: Ethernet interface
       physical id: f
       logical name: vethdc19914
       serial: be:59:17:46:01:d7
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.194.23 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:15
       description: Ethernet interface
       physical id: 10
       logical name: veth9c36494
       serial: 4e:af:25:e0:26:77
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.20.104 link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:16
       description: Ethernet interface
       physical id: 11
       logical name: veth2e1e7c9
       serial: 06:6a:7d:f7:07:39
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=169.254.216.214 link=yes multicast=yes port=twisted pair speed=10Gbit/s

sudo gdisk -l /dev/sda

GPT fdisk (gdisk) version 1.0.6

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sda: 234441648 sectors, 111.8 GiB
Model: Force LS SSD
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 91639514-C964-49A3-BB8C-90F7D0A1C96E
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 6108 sectors (3.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          524288   255.0 MiB   0700  Microsoft basic data
   2          526336        83886080   39.7 GiB    8300  Linux filesystem
   3        83888128       234441614   71.8 GiB    8300  Linux filesystem

Logs

... (normal kernel messages after booting)
[   49.775023] br-7777bfeb3730: port 1(veth0489e80) entered disabled state
[   49.776023] eth0: renamed from veth9d11393
[   49.816458] br-7777bfeb3730: port 1(veth0489e80) entered blocking state
[   49.816483] br-7777bfeb3730: port 1(veth0489e80) entered forwarding state
[164061.642655] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD OUT 
[164061.650878] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 05 d8 e5 80 00 00 10 00
... (messages like these repeating for 16 seconds)
[164077.002902] sd 0:0:0:0: [sda] tag#18 uas_eh_abort_handler 0 uas-tag 19 inflight: CMD OUT 
[164077.011301] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 0a 4b 9d 60 00 00 90 00
[164077.046927] scsi host0: uas_eh_device_reset_handler start
[164077.046927] scsi host0: uas_eh_device_reset_handler start
[164077.187758] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[164077.187758] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[164077.215063] scsi host0: uas_eh_device_reset_handler success
[164077.215063] scsi host0: uas_eh_device_reset_handler success
[164084.137487] xhci_hcd 0000:01:00.0: WARNING: Host System Error
[164084.137487] xhci_hcd 0000:01:00.0: WARNING: Host System Error
[164089.279106] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
[164089.287150] xhci_hcd 0000:01:00.0: USBSTS: HCHalted HSE EINT
[164089.293004] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[164089.300916] xhci_hcd 0000:01:00.0: HC died; cleaning up
[164089.306352] usb 1-1: USB disconnect, device number 2
[164089.279106] [164089.312578] usb 2-1: USB disconnect, device number 2
xhci_hcd[164089.318189] sd 0:0:0:0: [sda] tag#17 uas_zap_pending 0 uas-tag 2 inflight: CMD 
[164089.326243] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x2a 2a 00 05 d8 e5 80 00 00 10 00
[164089.334097] sd 0:0:0:0: [sda] tag#18 uas_zap_pending 0 uas-tag 3 inflight: CMD 
[164089.334113] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=57s
[164089.341594] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 05 d8 e5 90 00 00 10 00
[164089.341601] sd 0:0:0:0: [sda] tag#19 uas_zap_pending 0 uas-tag 4 inflight: CMD 
[164089.351572] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x2a 2a 00 05 d8 e5 80 00 00 10 00
[164089.359410] sd 0:0:0:0: [sda] tag#19 CDB: opcode=0x2a 2a 00 02 4e 27 68 00 00 90 00
[164089.359416] sd 0:0:0:0: [sda] tag#20 uas_zap_pending 0 uas-tag 5 inflight: CMD 
[164089.359420] sd 0:0:0:0: [sda] tag#20 CDB: opcode=0x2a 2a 00 0a 4b a2 c0 00 00 90 00
[164089.359428] sd 0:0:0:0: [sda] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=56s
[164089.359437] sd 0:0:0:0: [sda] tag#19 CDB: opcode=0x2a 2a 00 02 4e 27 68 00 00 90 00
[164089.359444] blk_update_request: I/O error, dev sda, sector 38676328 op 0x1:(WRITE) flags 0x800 phys_seg 18 prio class 0
[164089.359552] Aborting journal on device sda2-8.
[164089.366921] blk_update_request: I/O error, dev sda, sector 98100608 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
[164089.374762] sd 0:0:0:0: [sda] tag#21 uas_zap_pending 0 uas-tag 6 inflight: CMD 
[164089.374777] sd 0:0:0:0: [sda] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=54s
[164089.374786] sd 0:0:0:0: [sda] tag#20 CDB: opcode=0x2a 2a 00 0a 4b a2 c0 00 00 90 00
[164089.374792] blk_update_request: I/O error, dev sda, sector 172729024 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0
[164089.374805] zio pool=tank vdev=/dev/sda3 error=5 type=2 offset=45486538752 size=73728 flags=180880
[164089.382610] zio pool=tank vdev=/dev/sda3 error=5 type=2 offset=7276789760 size=8192 flags=184880
[164089.390099] sd 0:0:0:0: [sda] tag#21 CDB: opcode=0x2a 2a 00 04 51 e7 38 00 00 08 00
[164089.390108] sd 0:0:0:0: [sda] tag#22 uas_zap_pending 0 uas-tag 7 inflight: CMD 
[164089.512187] sd 0:0:0:0: [sda] tag#22 CDB: opcode=0x2a 2a 00 04 51 e7 20 00 00 08 00
[164089.520039] sd 0:0:0:0: [sda] tag#23 uas_zap_pending 0 uas-tag 8 inflight: CMD 
[164089.527535] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x2a 2a 00 04 51 e6 f8 00 00 08 00

Additional context

Based on the posts under https://github.com/raspberrypi/linux/issues/4930, the two issues maybe connected.

tomikaa87 avatar Jun 10 '22 18:06 tomikaa87

I've made another try with quirks to disable UAS for the case with usb-storage.quirks=174c:55aa:u, but it still crashes.

The kernel log is a bit different this time, I can't see the message about the unresponsive xHCI host:

[   33.758428] cam-dummy-reg: disabling
[   40.680442] br-7777bfeb3730: port 1(veth7334205) entered disabled state
[   40.681434] eth0: renamed from veth495c423
[   40.703761] br-7777bfeb3730: port 1(veth7334205) entered blocking state
[   40.703787] br-7777bfeb3730: port 1(veth7334205) entered forwarding state
[76306.695335] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[76306.716668] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x03 driverbyte=DRIVER_OK cmd_age=31s
[76306.716696] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 0c 28 e6 c0 00 00 30 00
[76306.716709] blk_update_request: I/O error, dev sda, sector 204007104 op 0x1:(WRITE) flags 0x700 phys_seg 3 prio class 0
[76306.716736] zio pool=tank vdev=/dev/sda3 error=5 type=2 offset=61500915712 size=24576 flags=40080c80
[76487.372365] sd 0:0:0:0: [sda] tag#0 timing out command, waited 180s
[76487.372414] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=180s
[76487.372432] sd 0:0:0:0: [sda] tag#0 Sense Key : 0x2 [current] 
[76487.372448] sd 0:0:0:0: [sda] tag#0 ASC=0x4 ASCQ=0x1 
[76487.372465] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 1a c2 40 00 00 80 00
[76487.372479] blk_update_request: I/O error, dev sda, sector 1753664 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0
[76488.829547] INFO: task jbd2/sda2-8:98 blocked for more than 120 seconds.
[76488.829579]       Tainted: P         C O      5.15.32-v8+ #1538
[76488.829590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[76488.829599] task:jbd2/sda2-8     state:D stack:    0 pid:   98 ppid:     2 flags:0x00000008
[76488.829623] Call trace:
[76488.829630]  __switch_to+0x114/0x1c8
[76488.829657]  __schedule+0x30c/0x8e8
[76488.829674]  schedule+0x48/0x100
[76488.829688]  io_schedule+0x24/0xa8
[76488.829703]  blk_mq_get_tag+0x1b8/0x310
[76488.829715]  __blk_mq_alloc_request+0x68/0x158
[76488.829731]  blk_mq_submit_bio+0x12c/0x5a0
[76488.829741]  __submit_bio+0x22c/0x280
[76488.829752]  submit_bio_noacct+0x1ec/0x228
[76488.829763]  submit_bio+0x3c/0x138
[76488.829773]  submit_bh_wbc+0x174/0x1f0
[76488.829789]  submit_bh+0x20/0x30
[76488.829803]  jbd2_journal_commit_transaction+0xc68/0x1bd0
[76488.829817]  kjournald2+0xd4/0x260
[76488.829831]  kthread+0x140/0x158
[76488.829842]  ret_from_fork+0x10/0x20

What's strange that the last two crash occurred after around 77000 seconds of runtime.

This is the log for the previous crash (before disabling UAS):

[   33.758593] cam-dummy-reg: disabling
[   40.099279] br-7777bfeb3730: port 1(vethed89085) entered disabled state
[   40.100249] eth0: renamed from veth44025ea
[   40.124528] br-7777bfeb3730: port 1(vethed89085) entered blocking state
[   40.124560] br-7777bfeb3730: port 1(vethed89085) entered forwarding state
[77254.805315] sd 0:0:0:0: [sda] tag#22 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD OUT 
[77254.805335] sd 0:0:0:0: [sda] tag#22 CDB: opcode=0x2a 2a 00 00 22 f9 b8 00 00 80 00
[77254.805695] sd 0:0:0:0: [sda] tag#19 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT 
[77254.805701] sd 0:0:0:0: [sda] tag#19 CDB: opcode=0x2a 2a 00 02 52 12 98 00 00 a0 00
[77254.805990] sd 0:0:0:0: [sda] tag#18 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT 
[77254.805996] sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 0a 61 0f f0 00 00 50 00
[77254.806292] sd 0:0:0:0: [sda] tag#17 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT 
[77254.806298] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x2a 2a 00 08 6b c7 e0 00 00 10 00
[77254.806598] sd 0:0:0:0: [sda] tag#16 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT 
[77254.806604] sd 0:0:0:0: [sda] tag#16 CDB: opcode=0x2a 2a 00 08 80 2d 00 00 00 20 00
[77258.133358] sd 0:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT 
[77258.133379] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x2a 2a 00 0a 61 0c d0 00 00 50 00
[77258.185364] sd 0:0:0:0: [sda] tag#20 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT 
[77258.185382] sd 0:0:0:0: [sda] tag#20 CDB: opcode=0x2a 2a 00 04 21 9c a0 00 00 08 00
[77258.201373] scsi host0: uas_eh_device_reset_handler start
[77258.330222] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[77258.353607] scsi host0: uas_eh_device_reset_handler success
[77265.267212] xhci_hcd 0000:01:00.0: WARNING: Host System Error
[77270.4[77270.415926] EXT4-fs (sda2): Delayed block allocation failed for inode 1968773 at logical offset 14997 with max blocks 1 with error 30
13573] x[77270.416517] EXT4-fs error (device sda2) in ext4_reserve_inode_write:5692: Journal has aborted
[77270.416705] EXT4-fs error (device sda2): ext4_dirty_inode:5888: inode #2104291: comm rs:main Q:Reg: mark_inode_dirty error
hci_hcd [77270.416846] EXT4-fs error (device sda2) in ext4_dirty_inode:5889: Journal has aborted
0000:01:[77270.419676] EXT4-fs error (device sda2) in __ext4_new_inode:1093: Journal has aborted
[77270.419953] EXT4-fs error (device sda2) in add_dirent_to_buf:2118: Journal has aborted
[77270.420105] EXT4-fs error (device sda2): ext4_check_bdev_write_error:218: comm containerd-shim: Error while async write back metadata
[77270.420269] EXT4-fs error (device sda2) in ext4_create:2759: Journal has aborted
00.0: xH[77270.420408] EXT4-fs error (device sda2) in ext4_orphan_add:188: Journal has aborted
CI host [77270.420539] EXT4-fs error (device sda2) in ext4_create:2759: Journal has aborted
not resp[77270.420673] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm containerd-shim: Detected aborted journal
onding to stop endpoint command.
[77270.413599] xhci_hcd 0000:01:00.0: USBSTS: HCHalted HSE EINT
[77270.413639] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[77270.413707] xhci_hcd 0000:01:00.0: HC died; cleaning up

Probably not a coincidence and I'd assume it depends on the amount of data written during that time.

tomikaa87 avatar Jun 11 '22 20:06 tomikaa87

Another test, another crash. I've tried a different case, it's an Axagon EE25-XA3, which has an ASMedia ASM1153 (174c:1153). It doesn't support UASP, so I've used BOT without enabling quirks. The result is the same, usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd. I didn't see the xHCI host controller not responding, assume dead message in dmesg however. The system ran for about 170000 seconds, more than double compared to the previous runs.

I'd like to help more by not just looking for dmesg output. How should I progress further?

I'm doing another test at the moment and replaced the SSD with an HDD in the XA6 case to see if it changes anything. I've also requested a firmware update from the case vendor, but since another version of the case has the same issue, I don't think it'll behave differently.

tomikaa87 avatar Jun 14 '22 05:06 tomikaa87

What happens if you use a powered USB3 hub between the Pi and the SSD?

P33M avatar Jun 14 '22 08:06 P33M

What happens if you use a powered USB3 hub between the Pi and the SSD?

I haven't tried it yet, this will be the next after the HDD test.

After that, I could try disabling ASPM for the USB host controller via sudo setpci -s 01:00.0 0xd4.B=0x40.

Another possibility could be disabling LPM via echo 0 | sudo tee /sys/bus/usb/devices/usb2/2-0\:1.0/usb2-port1/usb3_lpm_permit. It's currently in u1_u2 mode, but I can try u1 and u2 only too.

Edit: the HDD test is still running, it's still stable after 1 day and 12 hours. I'm going to let it run for a while to make sure it stays stable.

Edit2: HDD test finished, it ran stable for 4 days. I'm running the external hub test with the SSD now. The hub in question is a RaidSonic Icy Box IB-AC6113.

Output of lsusb -tv looks like this now:

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        ID 2109:0812 VIA Labs, Inc. VL812 Hub
        |__ Port 1: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M
            ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
        |__ Port 4: Dev 4, If 0, Class=Hub, Driver=hub/4p, 5000M
            ID 2109:0812 VIA Labs, Inc. VL812 Hub
            |__ Port 4: Dev 5, If 0, Class=Hub, Driver=hub/4p, 5000M
                ID 2109:0812 VIA Labs, Inc. VL812 Hub
                |__ Port 4: Dev 6, If 0, Class=Hub, Driver=hub/4p, 5000M
                    ID 2109:0812 VIA Labs, Inc. VL812 Hub
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        ID 2109:3431 VIA Labs, Inc. Hub
        |__ Port 1: Dev 3, If 0, Class=Hub, Driver=hub/4p, 480M
            ID 2109:2812 VIA Labs, Inc. VL812 Hub
            |__ Port 4: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
                ID 2109:2812 VIA Labs, Inc. VL812 Hub
                |__ Port 4: Dev 5, If 0, Class=Hub, Driver=hub/4p, 480M
                    ID 2109:2812 VIA Labs, Inc. VL812 Hub
                    |__ Port 4: Dev 6, If 0, Class=Hub, Driver=hub/4p, 480M
                        ID 2109:2812 VIA Labs, Inc. VL812 Hub

@P33M It died after 20 hours. The external hub doesn't seem to make a difference.

Kernel log:

[   53.348114] br-7777bfeb3730: port 1(vethe5db6c3) entered forwarding state
[71777.088275] sd 0:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD OUT
[71777.088295] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x2a 2a 00 02 4d 63 20 00 00 b8 00
[71777.144283] sd 0:0:0:0: [sda] tag#27 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT
[71777.144303] sd 0:0:0:0: [sda] tag#27 CDB: opcode=0x2a 2a 00 01 36 00 f0 00 00 08 00
[71777.144612] sd 0:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT
[71777.144619] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x2a 2a 00 01 36 00 d8 00 00 08 00
[71777.144929] sd 0:0:0:0: [sda] tag#25 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[71777.144934] sd 0:0:0:0: [sda] tag#25 CDB: opcode=0x2a 2a 00 01 36 00 90 00 00 28 00
[71777.145253] sd 0:0:0:0: [sda] tag#24 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
[71777.145258] sd 0:0:0:0: [sda] tag#24 CDB: opcode=0x2a 2a 00 01 8e 88 80 00 00 08 00
[71777.145566] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[71777.145571] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 0b 67 2a f0 00 01 00 00
[71777.145890] sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT
[71777.145895] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 07 70 11 60 00 00 10 00
[71787.584476] sd 0:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT
[71787.584507] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x2a 2a 00 01 1d b4 18 00 00 08 00
[71804.992697] sd 0:0:0:0: [sda] tag#15 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD OUT
[71804.992718] sd 0:0:0:0: [sda] tag#15 CDB: opcode=0x2a 2a 00 08 d0 58 80 00 00 10 00
[71804.993064] sd 0:0:0:0: [sda] tag#14 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD OUT
[71804.993070] sd 0:0:0:0: [sda] tag#14 CDB: opcode=0x2a 2a 00 08 d0 57 80 00 00 20 00
[71804.993372] sd 0:0:0:0: [sda] tag#13 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT
[71804.993377] sd 0:0:0:0: [sda] tag#13 CDB: opcode=0x2a 2a 00 08 d4 91 d0 00 00 50 00
[71806.564727] sd 0:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD OUT
[71806.564746] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x2a 2a 00 08 d4 90 e0 00 00 50 00
[71806.588747] scsi host0: uas_eh_device_reset_handler start
[71806.677615] usb 2-1.1: reset SuperSpeed USB device number 3 using xhci_hcd
[71806.701724] scsi host0: uas_eh_device_reset_handler success
[71813.628015] xhci_hcd 0000:01:00.0: WARNING: Host System Error
[71818.804946] xhci_hcd [71818.807043] EXT4-fs error (device sda2) in ext4_reserve_inode_write:5692: Journal has aborted
0000:01:[71818.807517] EXT4-fs error (device sda2) in ext4_reserve_inode_write:5692: Journal has aborted
00.0: xH[71818.807720] EXT4-fs error (device sda2) in ext4_reserve_inode_write:5692: Journal has aborted
CI host [71818.807934] EXT4-fs error (device sda2): ext4_dirty_inode:5888: inode #922716: comm lighttpd: mark_inode_dirty error
[71818.808074] EXT4-fs error (device sda2): ext4_dirty_inode:5888: inode #1966103: comm systemd-journal: mark_inode_dirty error
[71818.808297] EXT4-fs error (device sda2) in __ext4_new_inode:1093: Journal has aborted
[71818.808437] EXT4-fs error (device sda2) in ext4_dirty_inode:5889: Journal has aborted
not resp[71818.808584] EXT4-fs error (device sda2) in ext4_dirty_inode:5889: Journal has aborted
onding t[71818.808734] EXT4-fs error (device sda2) in __ext4_new_inode:1093: Journal has aborted
o stop e[71818.808869] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm lighttpd: Detected aborted journal
ndpoint [71818.812992] Read-error on swap-device (8:0:65177736)
command.[71818.813018] Read-error on swap-device (8:0:65177744)

[71818[71818.813031] Read-error on swap-device (8:0:65177752)
.804970][71818.813042] Read-error on swap-device (8:0:65177760)
 xhci_hc[71818.813053] Read-error on swap-device (8:0:65177768)
d 0000:0[71818.813064] Read-error on swap-device (8:0:65177776)
1:00.0: [71818.813075] Read-error on swap-device (8:0:65177784)
USBSTS: [71818.813086] Read-error on swap-device (8:0:65177792)
HCHalted HSE EINT
[71818.805011] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[71818.805108] xhci_hcd 0000:01:00.0: HC died; cleaning up
[71818.805215] usb 1-1: USB disconnect, device number 2
[71818.805235] usb 1-1.1: USB disconnect, device number 3
[71818.805252] usb 1-1.1.4: USB disconnect, device number 4
[71818.805268] usb 1-1.1.4.4: USB disconnect, device number 5
[71818.805506] usb 1-1.1.4.4.4: USB disconnect, device number 6
[71818.806330] usb 2-1: USB disconnect, device number 2
[71818.806353] usb 2-1.1: USB disconnect, device number 3
[71818.806642] sd 0:0:0:0: [sda] tag#23 uas_zap_pending 0 uas-tag 2 inflight: CMD
[71818.806659] sd 0:0:0:0: [sda] tag#23 CDB: opcode=0x2a 2a 00 02 4d 63 20 00 00 b8 00
[71818.806674] sd 0:0:0:0: [sda] tag#20 [71819.017600] Read-error on swap-device (8:0:65177672)

I'm now trying with disabled LPM.

Edit3: it died with LPM disabled. Right now the only working alternative seems to be using an HDD in the case instead of an SSD.

@P33M what should I look at next?

tomikaa87 avatar Jun 14 '22 08:06 tomikaa87

It's really a strange issue. One would think that loading both the USB-SATA bridge and the USB controller should lead to a controller crash, but unfortunately not.

I've tried fio to stress the SSD with random 4K reads, but it didn't crash after a couple of hours. I used the following command line:

fio --name TEST --filename=temp.file --rw=randread --size=10g --io_size=100g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=64 --direct=1 --numjobs=32 --group_reporting

It doesn't just run fine, but easily produces 34k IOPS which seems reasonable for this configuration.

It's a significantly higher load that the Docker containers put on the system, yet it doesn't hang the xHCI controller.

The system also seems to be stable during a random 4K read-write test, but it's running only for a few minutes at the time I'm writing this. I use the following command:

fio --name TEST --filename=temp.file --rw=randrw --size=10g --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=64 --group_reporting

Edit: R/W test didn't crash after 1:30 hours.

Edit2: I'm running an xHCI trace overnight, it might help if the system crashes.

tomikaa87 avatar Jun 30 '22 16:06 tomikaa87

I've finished the xHCI tracing. It crashed after 2 days.

I saw a few CPU:<cpu number> [LOST EVENTS] lines.

When the file system first detected the error, these event were logged:

z_wr_int-1516174 [003] d..4. 146840.479095: xhci_queue_trb: STREAM: Buffer 000000044d948000 length 4096 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-1516174 [003] d..4. 146840.479096: xhci_inc_enq: STREAM 00000000522b5f79: enq 0x0000000441b66d20(0x0000000441b66000) deq 0x0000000441b66ce0(0x0000000441b66000) segs 2 stream 2 free_trbs 497 bounce 1024 cycle 1
        z_wr_int-1516174 [003] d..4. 146840.479097: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 Stream 2
        z_wr_int-1516174 [003] d..3. 146840.479101: xhci_urb_enqueue: ep4out-bulk: urb 00000000f1504c87 pipe 3221357056 slot 2 length 0/32 sgs 0/0 stream 0 flags 00010100
        z_wr_int-1516174 [003] d..4. 146840.479102: xhci_queue_trb: BULK: Buffer 00000004a3219400 length 32 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-1516174 [003] d..4. 146840.479103: xhci_inc_enq: BULK 000000008743c171: enq 0x00000004419a7370(0x00000004419a7000) deq 0x00000004419a7360(0x00000004419a7000) segs 2 stream 0 free_trbs 500 bounce 1024 cycle 0
        z_wr_int-1516174 [003] d..4. 146840.479103: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 ep4out
        z_wr_int-410     [001] d..3. 146840.479510: xhci_urb_enqueue: ep3in-bulk: urb 000000004646c083 pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 1 flags 00010300
        z_wr_int-410     [001] d..4. 146840.479513: xhci_queue_trb: STREAM: Buffer 000000045f595500 length 112 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c
        z_wr_int-410     [001] d..4. 146840.479514: xhci_inc_enq: STREAM 0000000057d78eec: enq 0x000000044199e630(0x000000044199e000) deq 0x000000044199e620(0x000000044199e000) segs 2 stream 1 free_trbs 500 bounce 1024 cycle 1
        z_wr_int-410     [001] d..4. 146840.479515: xhci_ring_ep_doorbell: Ring doorbell for Slot  Stream 1
        z_wr_int-410     [001] d..3. 146840.479526: xhci_urb_enqueue: ep2out-bulk: urb 0000000007ed215d pipe 3221291520 slot 2 length 0/8192 sgs 2/2 stream 1 flags 00040000
        z_wr_int-410     [001] d..4. 146840.479528: xhci_queue_trb: STREAM: Buffer 0000000438cf4000 length 4096 TD size 4 intr 0 type 'Normal' flags b:i:i:C:s:i:e:c
        z_wr_int-410     [001] d..4. 146840.479528: xhci_inc_enq: STREAM 000000005820a4c6: enq 0x0000000441b63400(0x0000000441b63000) deq 0x0000000441b633f0(0x0000000441b63000) segs 2 stream 1 free_trbs 500 bounce 1024 cycle 1
        z_wr_int-410     [001] d..4. 146840.479529: xhci_queue_trb: STREAM: Buffer 0000000438cf5000 length 4096 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-410     [001] d..4. 146840.479530: xhci_inc_enq: STREAM 000000005820a4c6: enq 0x0000000441b63410(0x0000000441b63000) deq 0x0000000441b633f0(0x0000000441b63000) segs 2 stream 1 free_trbs 499 bounce 1024 cycle 1
        z_wr_int-410     [001] d..4. 146840.479531: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 Stream 1
        z_wr_int-410     [001] d..3. 146840.479533: xhci_urb_enqueue: ep4out-bulk: urb 00000000622f87c6 pipe 3221357056 slot 2 length 0/32 sgs 0/0 stream 0 flags 00010100
        z_wr_int-[146953.369538] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm kworker/u8:1: Detected aborted journal
410     [146953.369756] EXT4-fs (sda2): Delayed block allocation failed for inode 1968381 at logical offset 5790 with max blocks 1 with error 30
[146953.369775] EXT4-fs (sda2): This should not happen!! Data will be lost
[146953.369775] 
[146953.369801] EXT4-fs error (device sda2) in ext4_writepages:2792: Journal has aborted
[001] d.[146953.370397] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
.4. 1468[146953.370504] EXT4-fs (sda2): Remounting filesystem read-only
40.47953[146953.377427] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775628 pages, ino 1966211; err -30
4: xhci_[146953.383819] EXT4-fs error (device sda2): ext4_wait_block_bitmap:533: comm in:imklog: Cannot read block bitmap - block_group = 233, block_bitmap = 7340041
queue_tr[146953.383893] EXT4-fs error (device sda2): ext4_discard_preallocations:5000: comm in:imklog: Error -5 loading buddy information for 233
b: BULK:[146953.401929] EXT4-fs (sda2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 2120360, error -30)
 Buffer [146953.401989] EXT4-fs (sda2): failed to convert unwritten extents to written extents -- potential data loss!  (inode 2120362, error -30)
000000045f595680 length 32 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-410     [001] d..4. 146840.479535: xhci_inc_enq: BULK 000000008743c171: enq 0x00000004419a7380(0x00000004419a7000) deq 0x00000004419a7370(0x00000004419a7000) segs 2 stream 0 free_trbs 500 bounce 1024 cycle 0
        z_wr_int-410     [001] d..4. 146840.479537: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 ep4out
        z_wr_int-410     [001] d..3. 146840.480696: xhci_urb_enqueue: ep3in-bulk:[146953.535132] EXT4-fs error (device sda2): __ext4_get_inode_loc_noinmem:4366: inode #2116: block 1194: comm (md-udevd): unable to read itable block
 urb 0000000050138434 pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 2 flags 00010300
        z_wr_int-410     [001] d..4. 146840.480699: xhci_queue_trb: STREAM: Buffer 000000045f595280 length 112 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
        z_wr_int-410     [001] d..4. 146840.480700: xhci_inc_enq: STREAM 000000002b5d9d30: enq 0x000000044199caa0(0x000000044199c000) deq 0x000000044199ca90(0x000000044199c000) segs 2 stream 2 free_trbs 500 bounce 1024 cycle 0
        z_wr_int-410     [001] d..4. 146840.480701: xhci_ring_ep_doorbell: Ring doorbell for Slot  Stream 2
        z_wr_int-410   [146953.602895] EXT4-fs error (device sda2): __ext4_get_inode_loc_noinmem:4366: inode #2116: block 1194: comm (md-udevd): unable to read itable block
  [001] d..3. 146840.480710: xhci_urb_enqueue: ep2out-bulk: urb 0000000051b044ad pipe 3221291520 slot 2 length 0/8192 sgs 2/2 stream 2 flags 00040000
        z_wr_int-410     [001] d..4. 146840.480712: xhci_queue_trb: STREAM: Buffer 0000000438cf6000 length 4096 TD size 4 intr 0 type 'Normal' fl[146953.641957] EXT4-fs error (device sda2): __ext4_get_inode_loc_noinmem:4366: inode #2366516: block 9437667: comm (zed): unable to read itable block
ags b:i:i:C:s:i:e:c
        z_wr_int-410     [001] d..4. 146840.480713: xhci_inc_enq: STREAM 0000000052[146953.664436] EXT4-fs error (device sda2): __ext4_find_entry:1614: inode #2115: comm (md-udevd): reading directory lblock 0
2b5f79: enq 0x0000000441b66d30(0x0000000441b66000) deq 0x0000000441b66d20(0x0000000441b66000) segs 2 stream 2 free_trbs 500 bounce 1024 cycle 1
        z_wr_int-410     [001] d..4. 146840.480713: xhci_queue_trb: STREAM: Buffer 0000000438cf7000 length 4096 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-410     [001] d..4. 146840.480714: xhci_inc_enq: STREAM 00000000522b5f79: enq 0x0000000441b66d40(0x0000000441b66000) deq 0x0000000441b66d20(0x0000000441b66000) segs 2 stream 2 free_trbs 499 bounce 1024 cycle 1
        z_wr_[146953.725207] EXT4-fs error (device sda2): __ext4_find_entry:1614: inode #2115: comm (md-udevd): reading directory lblock 0
int-410     [001] d..4. 146840.480715: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 Stream 2
        z_wr_int-410     [001] d..3. 146840.480717: xhci_urb_enqueue: ep4out-bulk: urb 000000006a675781 pipe 3221357056 slot 2 length 0/32 sgs 0/0 stream 0 flags 00010100
        z_wr_int-410     [001] d..4. 146840.480720: xhci_queue_trb: BULK: Buffer 000000045f595700 length 32 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
        z_wr_int-410     [001] d..4. 146840.480721: xhci_inc_enq: BULK 000000008743c171: enq 0x00000004419a7390(0x00000004419a7000) deq 0x00000004419a7380(0x00000004419a7000) segs 2 stream 0 free_trbs 500 bounce 1024 cycle 0
        z_wr_int-410     [001] d..4. 146840.480721: xhci_ring_ep_doorbell: Ring doorbell for Slot 2 ep4out

After that, just before the controller died, I saw these:

          <idle>-0       [000] dNs2. 146953.345182: xhci_urb_giveback: ep3in-bulk: urb 0000000051b044ad pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 26 flags 00010300
          <idle>-0       [000] dNs2. 146953.345184: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 6, stream 27
          <idle>-0       [000] dNs2. 146953.345187: xhci_urb_giveback: ep3in-bulk: urb 00000000622f87c6 pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 27 flags 00010300
          <idle>-0       [000] dNs2. 146953.345188: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 6, stream 28
          <idle>-0       [000] dNs2. 146953.345191: xhci_urb_giveback: ep3in-bulk: urb 000000003a3b1c3e pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 28 flags 00010300
          <idle>-0       [000] dNs2. 146953.345192: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 6, stream 29
          <idle>-0       [000] dNs2. 146953.345197: xhci_urb_giveback: ep3in-bulk: urb 00000000cef798b4 pipe 3221324416 slot 2 length 0/112 sgs 0/0 stream 29 flags 00010300
          <idle>-0       [000] dNs2. 146953.345198: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 6, stream 30
          <idle>-0       [000] dNs2. 146953.345200: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 6, stream 31
          <idle>-0       [000] dNs2. 146953.345203: xhci_dbg_cancel_urb: Killing URBs for slot ID 2, ep index 7
          <idle>-0       [000] .Ns1. 146953.345238: xhci_dbg_cancel_urb: xHCI host controller is dead.
     kworker/0:1-1481649 [000] ..... 146953.347009: xhci_free_dev: RS 00000 high-speed Hub Ctx Entries 3 MEL 0 us Port# 1/4 [TT Slot 0 Port# 0 TTT 3 Intr 0] Addr 1 State configured
     kworker/0:1-1481649 [000] ..... 146953.347338: xhci_free_virt_device: vdev 0000000066dfecce ctx 44197f000 | 44197d000 fake_port 1 real_port 1 current_mel 0
     kworker/0:1-1481649 [000] ..... 146953.347341: xhci_ring_free: CTRL 0000000078f1104e: enq 0x0000000441980390(0x0000000441980000) deq 0x0000000441980390(0x0000000441980000) segs 2 stream 0 free_trbs 501 bounce 0 cycle 1
     kworker/0:1-1481649 [000] ..... 146953.347393: xhci_ring_free: INTR 000000004a8179e5: enq 0x0000000441989010(0x0000000441989000) deq 0x0000000441989010(0x0000000441989000) segs 2 stream 0 free_trbs 501 bounce 1 cycle 1
     kworker/0:1-1481649 [000] ..... 146954.125784: xhci_free_dev: RS 00000 super-speed Ctx Entries 8 MEL 2050 us Port# 2/0 [TT Slot 0 Port# 0 TTT 0 Intr 0] Addr 2 State configured
     kworker/0:1-1481649 [000] ..... 146954.126118: xhci_free_virt_device: vdev 000000008dedafd3 ctx 441983000 | 441982000 fake_port 1 real_port 2 current_mel 2050
     kworker/0:1-1481649 [000] ..... 146954.126120: xhci_ring_free: CTRL 0000000080657522: enq 0x0000000441984620(0x0000000441984000) deq 0x0000000441984620(0x0000000441984000) segs 2 stream 0 free_trbs 501 bounce 0 cycle 1
     kworker/0:1-1481649 [000] ..... 146954.126154: xhci_ring_free: BULK 000000001cf50f3b: enq 0x0000000441be0000(0x0000000441be0000) deq 0x0000000441be0000(0x0000000441be0000) segs 2 stream 0 free_trbs 501 bounce 1024 cycle 1
     kworker/0:1-1481649 [000] ..... 146954.126159: xhci_ring_free: STREAM 00000000f909bbb5: enq 0x0000000441b58000(0x0000000441b58000) deq 0x0000000441b58000(0x0000000441b58000) segs 2 stream 1 free_trbs 501 bounce 1024 cycle 1
     kworker/0:1-1481649 [000] ..... 146954.126170: xhci_ring_free: STREAM 00000000bcab0624: enq 0x0000000441b56000(0x0000000441b56000) deq 0x0000000441b56000(0x0000000441b56000) segs 2 stream 2 free_trbs 501 bounce 1024 cycle 1
     kworker/0:1-1481649 [000] ..... 146954.126179: xhci_ring_free: STREAM 00000000af07a6e6: enq 0x0000000441b53000(0x0000000441b53000) deq 0x0000000441b53000(0x0000000441b53000) segs 2 stream 3 free_trbs 501 bounce 1024 cycle 1

I've uploaded the full log file (1.3 GiB) zipped: https://www.icloud.com/iclouddrive/0fePUI0rqF_J_XGW_2BgkwORg#putty @P33M if you have time, can you please check it to see if there is anything suspicious?

tomikaa87 avatar Jul 02 '22 08:07 tomikaa87

It's very likely that this issue has to do something with the one that has been reported more than 2 years ago and still not resolved: https://github.com/raspberrypi/linux/issues/3404

tomikaa87 avatar Jul 04 '22 10:07 tomikaa87

Just wanted to chime in and say I'm having the exact same issue (so I won't post lenghy details). It seems we're using the exact same ASMedia chip as well.

Raspberry Pi 4B 8GB with external disk case connected to USB 3.0 port using a Kingston SV300S37A240G SSD.

# lsusb -tv
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=uas, 5000M
        ID 174c:55aa ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge, ASM1153E SATA 6Gb/s bridge
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        ID 2109:3431 VIA Labs, Inc. Hub

After the error happens, unplugging/plugging the disk makes no difference (the disk is gone and nothing is detected or shows up in dmesg). I have to reboot the device to get the disk detected again.

gtirloni avatar Jul 08 '22 10:07 gtirloni

@gtirloni Can you try it with a powered USB 3.0 hub just to be sure it's not power related?

tomikaa87 avatar Jul 08 '22 12:07 tomikaa87

@tomikaa87 sure, just connected them with a powered USB 3.0 hub. It seems to be working and I'm sync'ing about 100GB to it. I'll report back soon.

gtirloni avatar Jul 08 '22 13:07 gtirloni

@gtirloni thanks. If it's really the same issue, it should not crash when you copy a large data set. It seems to be sensitive to short I/O operations issued during a longer period of time (1-2 days in my case). I couldn't make it crash by copying 200+ GBytes of data.

tomikaa87 avatar Jul 08 '22 15:07 tomikaa87

I did a fresh install of the previous LTS version of Ubuntu Server (20.04), which has a 5.4.0-1065-raspi #75-Ubuntu kernel, but the issue didn't go away. The system is now on an SD card.

journalctl has the following errors:

Jul 14 08:53:01 services kernel: xhci_hcd 0000:01:00.0: WARNING: Host System Error
Jul 14 08:53:06 services kernel: xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
Jul 14 08:53:06 services kernel: xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
Jul 14 08:53:06 services kernel: xhci_hcd 0000:01:00.0: HC died; cleaning up
Jul 14 08:53:06 services kernel: blk_update_request: I/O error, dev sda, sector 104459696 op 0x1:(WRITE) flags 0x700 phys_seg 32 prio class 0
Jul 14 08:53:06 services kernel: zio pool=tank vdev=/dev/sda3 error=5 type=2 offset=10532642816 size=131072 flags=184880

I can't see the drive even if I unplug and re-plug it.

tomikaa87 avatar Jul 14 '22 07:07 tomikaa87

I'm seeing similar error on my Raspberry pi 4B 8GB, connected to Seagate External HDD via Orico USB Hub(externally powered, uses chip VL817)

[ 1.422691] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1 [ 1.424799] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.15 [ 1.424838] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 1.424866] usb usb1: Product: xHCI Host Controller [ 1.424889] usb usb1: Manufacturer: Linux 5.15.32-v8+ xhci-hcd [ 1.424911] usb usb1: SerialNumber: 0000:01:00.0 [ 1.425645] hub 1-0:1.0: USB hub found [ 1.426519] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2 [ 1.426557] xhci_hcd 0000:01:00.0: Host supports USB 3.0 SuperSpeed [ 1.427021] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.15 [ 1.427057] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 1.427085] usb usb2: Product: xHCI Host Controller [ 1.427107] usb usb2: Manufacturer: Linux 5.15.32-v8+ xhci-hcd [ 1.427130] usb usb2: SerialNumber: 0000:01:00.0 [ 1.427811] hub 2-0:1.0: USB hub found [ 1.430461] usbcore: registered new interface driver uas [ 1.430571] usbcore: registered new interface driver usb-storage [ 1.441388] usbcore: registered new interface driver usbhid [ 1.441412] usbhid: USB HID core driver [ 1.678436] usb 1-1: new high-speed USB device number 2 using xhci_hcd [ 1.829335] usb 1-1: New USB device found, idVendor=2109, idProduct=3431, bcdDevice= 4.21 [ 1.829394] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0 [ 1.829422] usb 1-1: Product: USB2.0 Hub [ 1.831271] hub 1-1:1.0: USB hub found [ 1.954928] usb 2-1: new SuperSpeed USB device number 2 using xhci_hcd [ 1.976975] usb 2-1: New USB device found, idVendor=2109, idProduct=0817, bcdDevice= 2.14 [ 1.977034] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [ 1.977062] usb 2-1: Product: USB3.0 Hub
[ 1.977085] usb 2-1: Manufacturer: VIA Labs, Inc.
[ 1.980732] hub 2-1:1.0: USB hub found [ 2.126410] usb 1-1.1: new high-speed USB device number 3 using xhci_hcd [ 2.232502] usb 1-1.1: New USB device found, idVendor=2109, idProduct=2817, bcdDevice= 2.14 [ 2.232564] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [ 2.232593] usb 1-1.1: Product: USB2.0 Hub
[ 2.232617] usb 1-1.1: Manufacturer: VIA Labs, Inc.
[ 2.234472] hub 1-1.1:1.0: USB hub found [ 2.554785] usb 2-1.3: new SuperSpeed USB device number 3 using xhci_hcd [ 2.576506] usb 2-1.3: New USB device found, idVendor=0bc2, idProduct=231a, bcdDevice= 7.08 [ 2.576566] usb 2-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 2.576595] usb 2-1.3: Product: Expansion [ 2.576618] usb 2-1.3: Manufacturer: Seagate [ 2.576640] usb 2-1.3: SerialNumber: NAA3CXYQ [ 6.752599] usbcore: registered new interface driver brcmfmac [ 37.991141] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 83.930946] xhci_hcd 0000:01:00.0: USBSTS: HCHalted HSE EINT [ 83.931102] usb 1-1: USB disconnect, device number 2 [ 83.931111] usb 1-1.1: USB disconnect, device number 3 [ 83.931159] usb 2-1: USB disconnect, device number 2 [ 83.931166] usb 2-1.3: USB disconnect, device number 3

brian@parmesan:/ $ lsusb -tv /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M ID 1d6b:0003 Linux Foundation 3.0 root hub /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M ID 1d6b:0002 Linux Foundation 2.0 root hub

p.s. Its been 2 weeks I've been trying to connect Raspberry pi with my 4TB External drive, without the USB hub it works for little while but ultimately I see wifi dropping both 2.4GHz and 5GHz. With USB hub, this is what I see -_-

lucidpuppy avatar Jul 14 '22 18:07 lucidpuppy

@you-are-invited thanks for the info.

I tried to find images of the controller board your HDD has and probably found one that might match, and I see an ASMedia controller, but it's possible that your drive has a totally different type (JMicron for instance). Can you somehow check which one you have? Unfortunately the only way to be sure is to look at the chip on the controller board. If you can and want to do this, you should look at the biggest IC with the most legs/solder pads and write down the model number as well (a string that looks like ASM225CM or JMS567 etc.)

The strange thing is that I didn't have any issues when I put an HDD in the case, only with SSDs, but I can imagine that another controller might have problems with all kinds of disks.

tomikaa87 avatar Jul 14 '22 20:07 tomikaa87

Anyways, I followed this sticky:

https://forums.raspberrypi.com/viewtopic.php?f=28&t=245931 It doesn't crashes anymore. But I do see very frequent "reset SuperSpeed USB device number 3 using xhci_hcd"

also indepth post I found related to the same problem: https://forums.raspberrypi.com/viewtopic.php?t=258242&start=25

[ 1.453858] usbhid: USB HID core driver [ 1.690736] usb 1-1: new high-speed USB device number 2 using xhci_hcd [ 1.841618] usb 1-1: New USB device found, idVendor=2109, idProduct=3431, bcdDevice= 4.21 [ 1.841679] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0 [ 1.841706] usb 1-1: Product: USB2.0 Hub [ 1.843432] hub 1-1:1.0: USB hub found [ 1.967202] usb 2-1: new SuperSpeed USB device number 2 using xhci_hcd [ 1.989280] usb 2-1: New USB device found, idVendor=2109, idProduct=0817, bcdDevice= 2.14 [ 1.989350] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [ 1.989376] usb 2-1: Product: USB3.0 Hub
[ 1.989399] usb 2-1: Manufacturer: VIA Labs, Inc.
[ 1.992990] hub 2-1:1.0: USB hub found [ 2.138751] usb 1-1.1: new high-speed USB device number 3 using xhci_hcd [ 2.240809] usb 1-1.1: New USB device found, idVendor=2109, idProduct=2817, bcdDevice= 2.14 [ 2.240867] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [ 2.240896] usb 1-1.1: Product: USB2.0 Hub
[ 2.240920] usb 1-1.1: Manufacturer: VIA Labs, Inc.
[ 2.242800] hub 1-1.1:1.0: USB hub found [ 2.551104] usb 2-1.3: new SuperSpeed USB device number 3 using xhci_hcd [ 2.572853] usb 2-1.3: New USB device found, idVendor=0bc2, idProduct=231a, bcdDevice= 7.08 [ 2.572911] usb 2-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 2.572940] usb 2-1.3: Product: Expansion [ 2.572963] usb 2-1.3: Manufacturer: Seagate [ 2.572985] usb 2-1.3: SerialNumber: NAA3CXYQ [ 2.588884] usb 2-1.3: UAS is ignored for this device, using usb-storage instead [ 2.589042] usb 2-1.3: UAS is ignored for this device, using usb-storage instead [ 2.589074] usb-storage 2-1.3:1.0: USB Mass Storage device detected [ 2.589631] usb-storage 2-1.3:1.0: Quirks match for vid 0bc2 pid 231a: 800000 [ 2.589865] scsi host0: usb-storage 2-1.3:1.0 [ 7.715047] usbcore: registered new interface driver brcmfmac [ 8.959644] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 9.947690] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 10.359678] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 11.115280] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 11.563308] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 11.987289] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 12.423308] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 12.871338] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 13.287319] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 13.471323] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 13.891286] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 14.319326] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 14.775714] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 15.191309] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd [ 15.643347] usb 2-1.3: reset SuperSpeed USB device number 3 using xhci_hcd

Update1: I let it soak overnight, had some backup's and downloads running.

brian@parmesan:/media/jamun $ ls ls: reading directory '.': Input/output error

Doesn't look well!!

[ 1009.106954] usb 2-1.3: USB disconnect, device number 3 [ 1009.563120] usb 2-1.3: device not accepting address 3, error -22 [ 1010.059117] usb 2-1.3: device not accepting address 3, error -22 [ 1010.555119] usb 2-1.3: device not accepting address 3, error -22 [ 1011.051123] usb 2-1.3: device not accepting address 3, error -22 [ 1012.063745] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd [ 1012.639416] usb 2-1.3: new SuperSpeed USB device number 4 using xhci_hcd [ 1012.660716] usb 2-1.3: New USB device found, idVendor=0bc2, idProduct=231a, bcdDevice= 7.08 [ 1012.660731] usb 2-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 1012.660737] usb 2-1.3: Product: Expansion [ 1012.660742] usb 2-1.3: Manufacturer: Seagate [ 1012.660746] usb 2-1.3: SerialNumber: NAA3CXYQ [ 1012.675651] usb 2-1.3: UAS is ignored for this device, using usb-storage instead [ 1012.675701] usb 2-1.3: UAS is ignored for this device, using usb-storage instead [ 1012.675707] usb-storage 2-1.3:1.0: USB Mass Storage device detected [ 1012.675958] usb-storage 2-1.3:1.0: Quirks match for vid 0bc2 pid 231a: 800000 [ 1012.676034] scsi host1: usb-storage 2-1.3:1.0 [ 1013.995733] usb 2-1.3: reset SuperSpeed USB device number 4 using xhci_hcd [ 1014.543722] usb 2-1.3: reset SuperSpeed USB device number 4 using xhci_hcd [ 1014.951731] usb 2-1.3: reset SuperSpeed USB device number 4 using xhci_hcd

lucidpuppy avatar Jul 15 '22 04:07 lucidpuppy

@you-are-invited UAS quirks (essentially falling back to BOT protocol) doesn’t seem to help. I’ve tried it multiple times in various configurations, but at the end the xHCI controller died anyway.

tomikaa87 avatar Jul 15 '22 07:07 tomikaa87

One possibility is a malformed packet over PCIe is causing target or completer aborts (which the xhci controller generally can't recover from). One way in which this can happen is marginal silicon that needs more volts to be stable.

What happens if you specify over_voltage=2 in /boot/config.txt?

P33M avatar Jul 15 '22 14:07 P33M

What happens if you specify over_voltage=2 in /boot/config.txt?

I've tried that just forgot to mention, it didn't change anything other than the base temperature of the SoC. I also tried with forced turbo mode to see if CPU clock switching causes it, but nothing.

@P33M is there a possibility that this is a bug in the VL805 firmware? If there is, do you have a way to debug it and report it to VIA Labs so they can release an updated version?

tomikaa87 avatar Jul 15 '22 14:07 tomikaa87

I suppose the acid test would be to swap the Pi4 board for another, keeping all other components exactly the same (and the same VLI firmware version), and seeing if the crash still happens.

P33M avatar Jul 15 '22 14:07 P33M

That's doable, I have another Pi4 (rev 1.5, 8GB), but @you-are-invited ran into this problem with that same exact board version.

tomikaa87 avatar Jul 15 '22 15:07 tomikaa87

Just wanted to report that my RPI 4B 8GB has been running stable for days after connecting the SSD to a powered USB 3.0 hub.

gtirloni avatar Jul 15 '22 22:07 gtirloni

@P33M tried the 8GB Pi4, but no difference. Same VL805 firmware.

tomikaa87 avatar Jul 16 '22 10:07 tomikaa87

Well at least that should eliminate a hardware fault within the Pi as a cause. "Crashes randomly after 1-2 days" isn't going to be reproducible here.

One thing I note is that in the trace, both reads and writes are simultaneously queued after a period of infrequent activity. Are there FIO settings that cause large UAS queue depths for both read and write (32 tags for each), and does this make things crash faster?

P33M avatar Jul 18 '22 09:07 P33M

I'm trying with this, but nothing yet:

fio --name TEST --filename=temp.file --rw=randrw --size=2g --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1024 --direct=1 --numjobs=16 --group_reporting

I/O depth is 1024 and I'd assume for randrw test it means both read and write queues.

@P33M unfortunately I didn't find a way to crash the system with FIO. I tried various IO depths ranging from 32 to 1024. Also experimented with numjobs, but the system stayed stable even with 128 jobs. I was absolutely hammering the disk but not a single UAS error was in dmesg. The Docker containers I use put a different load on it. The heaviest one is Grafana with Prometheus. It writes at 400 KB/s in average because Prometheus collects the data samples every 15 seconds. So it's not a constant load, but small bursts. I also tested the system by moving Prometheus's data to a network share and the USB controller still crashed.

tomikaa87 avatar Jul 18 '22 11:07 tomikaa87

I have met same problems. I connected a 3.0 hub with two 3.0 udisks to my broad and tried "dd if=/run/media/sdb1/b.log of=/run/media/sda1/11.log bs=2MB count=1024". And then problem happend. But I replace the 3.0 hub with a 2.0 hub, it works well.

yangxiao989 avatar Jul 19 '22 07:07 yangxiao989

here is the log:

[13186.058161] LOG xhci_urb_dequeue 1549
[13186.062877] LOG xhci_urb_dequeue 1549
[13191.165960] LOG xhci_stop_endpoint_command_watchdog 961
[13191.166040] LOG xhci_stop_endpoint_command_watchdog 961
[13191.174195] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead
[13191.174537] hub 2-1:1.0: hub_ext_port_status failed (err = -22)
[13191.174591] usb 2-1-port3: cannot reset (err = -22)
[13191.174635] usb 2-1-port3: cannot reset (err = -22)
[13191.174676] usb 2-1-port3: cannot reset (err = -22)
[13191.174714] usb 2-1-port3: cannot reset (err = -22)
[13191.174752] usb 2-1-port3: cannot reset (err = -22)
[13191.174771] usb 2-1-port3: Cannot enable. Maybe the USB cable is bad?
[13191.174814] usb 2-1-port3: cannot disable (err = -22)
[13191.174876] hub 2-1:1.0: hub_ext_port_status failed (err = -22)
[13191.174916] usb 2-1-port3: cannot reset (err = -22)
[13191.174953] usb 2-1-port3: cannot reset (err = -22)
[13191.174990] usb 2-1-port3: cannot reset (err = -22)
[13191.175025] usb 2-1-port3: cannot reset (err = -22)
[13191.175063] usb 2-1-port3: cannot reset (err = -22)
[13191.175082] usb 2-1-port3: Cannot enable. Maybe the USB cable is bad?
[13191.175103] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up
[13191.175121] usb 2-1-port3: cannot disable (err = -22)
[13191.175183] hub 2-1:1.0: hub_ext_port_status failed (err = -22)
[13191.294601] print_req_error: I/O error, dev sdb, sector 18902824
[13191.338621] print_req_error: I/O error, dev sda, sector 56040064
[13191.348613] FAT-fs (sda1): FAT read failed (blocknr 13956)
[13191.352478] print_req_error: I/O error, dev sda, sector 56042112
[13191.424874] FAT-fs (sda1): FAT read failed (blocknr 13956)
[13191.438964] FAT-fs (sdb1): FAT read failed (blocknr 8050)
[13191.444686] FAT-fs (sdb1): FAT read failed (blocknr 8050)
[13191.450351] FAT-fs (sdb1): FAT read failed (blocknr 8050)
[13191.455816] FAT-fs (sdb1): FAT read failed (blocknr 8050)
[13191.461251] FAT-fs (sdb1): FAT read failed (blocknr 8050)
[13191.466712] FAT-fs (sdb1): FAT read failed (blocknr 8050)

yangxiao989 avatar Jul 19 '22 07:07 yangxiao989

@yangxiao989 can you show us the output of lsusb -tv, sudo rpi-eeprom-update, uname -a, lshw and usb-devices commands?

tomikaa87 avatar Jul 19 '22 08:07 tomikaa87

@yangxiao989 can you show us the output of lsusb -tv, sudo rpi-eeprom-update, uname -a, lshw and usb-devices commands?

Sry, I don't have a raspberrypi. I just google for the similiar err and then found this page. But when I tried 'lsusb' after xhci died, only usb controller can be seen while hub or udisks were dismissed.

yangxiao989 avatar Jul 20 '22 09:07 yangxiao989

@yangxiao989 can you show us the output of lsusb -tv, sudo rpi-eeprom-update, uname -a, lshw and usb-devices commands?

Sry, I don't have a raspberrypi. I just google for the similiar err and then found this page. But when I tried 'lsusb' after xhci died, only usb controller can be seen while hub or udisks were dismissed.

This repo and tracker is specifically for Raspberry Pi problems only.

JamesH65 avatar Jul 20 '22 09:07 JamesH65

But in this case it is interesting to see similar error messages on a different system.

pelwell avatar Jul 20 '22 09:07 pelwell