Unable to see nvme drive Identify Controller failed
Bug Report
Unable to see nvme drive when applying config:
talosctl -n <ip> disks --insecure
--blank--
Description
As per the logs below, the node boots, then hangs for 60s, is unable to discover the drive and then proceeds. Ironically the node has just booted from the drive it is unable to discover.
I have tried this with three different drives and have a summary below. I upgraded the firmware of the problematic drive to no avail.
Each drive was flashed with https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4-arm64.img.xz according to the instructions at https://www.talos.dev/v1.1/talos-guides/install/single-board-computers/rockpi_4/.
The node always boots from the same SPI u-boot boot loader
Drive A, product SA2000M8/1000G, revision 9907307-008.A01G, firmware S5Z42105 << works Drive B, product SA2000M8/1000G, revision 9907307-008.A01G, firmware S5Z42105 << occasional issues Drive B, product SA2000M8/1000G, revision 9907307-008.A01G, firmware S5Z42109 << occasional issues Drive C, product SA2000M8/1000G, revision 9907307-008.A02G, firmware S5Z44106 << works
Logs
[ 51.984169] random: crng init done
[ 64.480306] nvme nvme0: I/O 0 QID 0 timeout, diable controller
[ 64.588115] nvme nvme0: Device shutdown incomplete: abort shutdown
[ 64.589513] nvme nvme0: Identify Controller failed (-4)
[ 64.590150] nvme nvme0: Removing after probe failure status: -5

Environment
- Talos version: 1.1.0, https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4-arm64.img.xz
- Kubernetes version: N/A
- Platform: Rock Pi 4A https://www.talos.dev/v1.1/talos-guides/install/single-board-computers/rockpi_4/
Quick question:
How are you powering the RockPi, some nvme devices need enough power and it could be that a low power situation might prevent it from initialized correctly. I've a rockpi connected to a samsung evo ssd and powered by the rockpi poe hat.
No worries- I’m also very open to suggestions that the drive or board is defective. I have a new SSD arriving next week to test with.
The thing that is curious though is that I have 5 identical nodes in terms of hardware model (SSD revisions vary between node). And it seems that it’s only this SSD installed in this node that reproduces the problem.
If I install the trouble SSD in another node it works fine most of the time. Or using another SSD in the trouble node also works fine most of the time. I say most of the time because it still occasionally reproduces the issue in these configurations.
Each node is as follows:
- Rock Pi 4A 1.4 4GB with SPI uboot flashed as per talos instructions
- Kingston A2000 SSD
- 20W USB-C PD (HEYMIX 20W PD Charger)
- Rock Pi 4 heatsink and M.2 adapter board