linux icon indicating copy to clipboard operation
linux copied to clipboard

nvme0: controller is down (additional log, >1 year worth) now runs hot

Open graemev opened this issue 11 months ago • 8 comments

Describe the bug

I raised this , via Debian's reportbug(1) , it got returned as "Closing as this is not a Debian system but running on a derivative."

This is simply an attempt to provide more logs of what is likely a 3 year old problem.

2 points of interest, 1: ran without issues for over a year (same hardware, limited use, apt-gt update on most uses) 2: without a power cycle (but with a reboot) produces different errors [dates noted in syslog attached]

I'd anticipate this gets merged with an existing bug (just to add the logs)

AFYI. Some reading I did around this suggests that M$ in windows 10 only used the deepest power saving mode of NVMe while the system was suspended. It appears the Rpi allows this mode while running normally (e.g. allows a latency larger than it can actually accept during normal running) ..so I'm guessing the hardware gets little testing of these modes.

bug-report-sysinfo.txt

reportbug-linux-image-6.6.62+rpt-rpi-2712-20250208114647-yanddw3j.txt

Since, I've moved so much to external text files (thank goodness) I'll just add the key log lines:

Jan 28 14:50:33 argon kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jan 28 14:50:33 argon kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jan 28 14:50:33 argon kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 28 14:50:33 argon kernel: nvme 0000:01:00.0: enabling device (0000 -> 0002)

So I'm sure you've seen this a lot already, of interest are the dates and the behaviour of BOOT vs POWER CYCLE

graemev avatar Feb 10 '25 13:02 graemev

When the template says "Describe the bug", you are meant to describe the bug. Much TL, so DR. Have you heard of pastebin et al?

pelwell avatar Feb 10 '25 14:02 pelwell

I was pointed here as a location to file a reportbug(1) report. The support line in the raspberry pi variant of Debian say:

HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

This is clearly not correct as "Debian" reject these as:

"Closing as this is not a Debian system but running on a derivative."

So the support URL and probably the mail address ins reportbug(1) should probably be updated to point at the correct support channel .

FYI I am not raising this because I expect help , I'm trying to get the log added to the appropriate bug . My 1st guess would be the rpi maintainer will say it's not pi specific and pass it upstream ....but I could be wrong, there may well be a unique Pi defect he's aware of.

(and yes I agree, it looks a mess ... not a very good interface to submit a bug report, more like some kind of "online help & support page" ) ...even "attach a file" would be more usable

graemev avatar Feb 10 '25 16:02 graemev

I'm trying to get the log added to the appropriate bug

You've reached the horses mouth.

the rpi maintainer

That would be me, or one of a very small set of colleagues who will already have seen your report.

even "attach a file" would be more usable

You must have missed the part directly below the initial "Describe the bug" section where there's a paperclip icon and the words "Paste, drop or click to add files".

pelwell avatar Feb 10 '25 16:02 pelwell

Ahh, thanks ...and yes I did "miss the paper-clip" ..... feel free to delete the above , I'll submit a more readable version (reportbug(1) output) via the paperclip.

AFYI: The box labelled " System*" says:

Copy and paste the results of the raspinfo command

The output of raspinfo(1) was too big to cut&paste into that box.

In my defence, I'd spent a while collecting data for this to submit to Debian , when they punted it back, I had trouble finding a better location for the report ... then, when I found somewhere (by asking directions) I found a GUI with the "damn fixed text boxes" one finds on marketing type sites (and they usually lack "paperclips") ...after about 20+ failed submits I was just hacking large sections out of the report to try to fit it in.

STOP PRESS Seems this site allows me to update previous entries , so I've done that :-)

graemev avatar Feb 10 '25 17:02 graemev

Just thought I'd chime in as this error caused me a bit of confusion. In my case, it was caused by a lack of shielding on the cable between my Pi5 and my M.2 NVMe hat. Wrapping a piece of aluminium foil around the cable resolved it (although I've now bought a proper shielded cable).

gordoste avatar Mar 18 '25 02:03 gordoste

What drive? What interface hardware? What power supply?

Jan 28 14:50:33 argon kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 = you got here because your PCIe link stopped working and either the nvme driver got all 1s back from a read, or some critical command timed out. There's many reasons why that could happen.

P33M avatar Apr 03 '25 13:04 P33M

Raspberry Pi 5, with the Argon Neo 5 NVMe case (so the NVMe controller they include) [ https://argon40.com/products/argon-neo-5-m-2-nvme-for-raspberry-pi-5 ] the NVMe stick was

fanxiang M.2 SSD 2TB, Up to 3500MB/s, 2TB NVMe SSD PCIe Gen3 x4 2280,QLC 3D NAND, Internal Solid State Drives with Graphene Cooling Sticker for Desktop, Laptop -S501Q

I eventually gave up on this combo and replaced the NVMe with

WD Green SN350 NVMe SSD 2TB M.2

From then on the problem did not reoccur.

The PSU is the "genuine" 27W PSU from PiHut

As, I say, I'm not expecting a "fix" , just that with a year of logs and various kernel update over that period I think there may be some useful data in the logs.

graemev avatar Apr 07 '25 12:04 graemev

On the Raspberry Pi 5, I never ran the Argon One v5 install script at https://download.argon40.com/argon1v5.sh and have been running NixOS just fine with the with the Argon One v5 dual NVMe case. Up until today, that is, when I ran the install script as part of installing the Argon v5 OLED accessory. That messed something up so badly I started getting the error described in this issue and eventually corrupted data on the first NVMe partition which prompted a total re-partition and re-installation of NixOS.

Unfortunately, I don't know exactly what was in my /boot/firmware/config.txt prior to running that shell script, but apparently it adds stuff in there:

[all]
dtparam=nvme
dtparam=pciex1_gen=3
dtparam=usb_max_current_enable=1
dtoverlay=dwc2,dr_mode=host

I think that caused interference of some kind, as it all went away with this config under the [all] section:

[all]
dtparam=nvme
dtparam=pciex1
dtoverlay=dwc2,dr_mode=host

And the OLED accessory is not installed now.

fredrikaverpil avatar Aug 24 '25 20:08 fredrikaverpil