thinkfan icon indicating copy to clipboard operation
thinkfan copied to clipboard

Freeze on Thinkpad P14s Gen3 AMD Machine Type 21J5

Open simonsystem opened this issue 2 years ago • 30 comments

Hi, im having trouble with my Thinkpad P14s Gen3 AMD Machine Type 21J5. Evertime, when I start Thinkfan, its freezing after a random amount of time. No logs, direct freeze, without turning black.

I already tried:

  • Using options thinkpad_acpi fan_control=1 experimental=1 in modprobe.conf.
  • Using amd_pstate=active as kernel param.
  • Using amdgpu.dcdebugmask=0x10 as kernel param.
  • Switching to zcfan but with same issue. (Freeze after some minutes)
  • Disabling CPU Power Management in BIOS settings.
  • Disable sleep/hibernate in power settings via KDE Plasma settings manager. (is now staying turned on)

This is my thinkfan.conf:

sensors:
  - hwmon: /sys/class/hwmon
    name: thinkpad
    indices: [1, 3, 4, 5, 6, 7]

  - hwmon: /sys/class/hwmon
    name: thinkpad
    indices: [8]
    optional: true

  - hwmon: /sys/class/hwmon
    name: nvme
    indices: [1]

  - hwmon: /sys/class/hwmon
    name: acpitz
    indices: [1]

fans:
  - tpacpi: /proc/acpi/ibm/fan

levels:
 - [0, 0, 55]
 - [1, 50, 60]
 - [2, 55, 65]
 - [3, 60, 70]
 - [4, 65, 75]
 - [5, 70, 80]
 - [7, 75, 85]
 - ["level disengaged", 80, 255]

This is my journal for thinkfan systemd service:

-- Boot 40cba4c2651649b4a54e90663138bc5e --
Mai 30 10:43:38 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 10:43:38 copper thinkfan[898]: Daemon PID: 899
Mai 30 10:43:38 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 10:43:38 copper thinkfan[899]: Temperatures(bias): 77(0) -> Fans: level 7
Mai 30 10:43:45 copper thinkfan[899]: Temperatures(bias): 74(0) -> Fans: level 5
Mai 30 10:43:55 copper thinkfan[899]: Temperatures(bias): 60(0) -> Fans: level 3
Mai 30 10:44:05 copper thinkfan[899]: Temperatures(bias): 56(0) -> Fans: level 2
Mai 30 10:44:10 copper thinkfan[899]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 10:46:05 copper thinkfan[899]: Temperatures(bias): 49(0) -> Fans: level 0
-- Boot fd3f2ec8214340a0922ff31e68d09722 --
Mai 30 11:55:58 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 11:55:58 copper thinkfan[651]: Daemon PID: 654
Mai 30 11:55:58 copper thinkfan[654]: Temperatures(bias): 86(0) -> Fans: level 127
Mai 30 11:55:58 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 11:56:14 copper thinkfan[654]: Temperatures(bias): 73(0) -> Fans: level 5
Mai 30 11:56:26 copper thinkfan[654]: Temperatures(bias): 68(0) -> Fans: level 4
Mai 30 11:56:48 copper thinkfan[654]: Temperatures(bias): 64(0) -> Fans: level 3
Mai 30 11:57:13 copper thinkfan[654]: Temperatures(bias): 58(0) -> Fans: level 2
Mai 30 11:58:50 copper thinkfan[654]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 11:59:27 copper thinkfan[654]: Temperatures(bias): 67(0) -> Fans: level 3
Mai 30 12:00:21 copper thinkfan[654]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 12:00:46 copper thinkfan[654]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 12:02:11 copper thinkfan[654]: Temperatures(bias): 49(0) -> Fans: level 0
-- Boot 400f452fc3fd4221a1821cb9ed5fea3e --
Mai 30 13:11:10 copper systemd[1]: Starting simple and lightweight fan control program...
Mai 30 13:11:10 copper thinkfan[633]: Daemon PID: 635
Mai 30 13:11:10 copper systemd[1]: Started simple and lightweight fan control program.
Mai 30 13:11:10 copper thinkfan[635]: Temperatures(bias): 86(0) -> Fans: level 127
Mai 30 13:11:26 copper thinkfan[635]: Temperatures(bias): 75(0) -> Fans: level 7
Mai 30 13:11:36 copper thinkfan[635]: Temperatures(bias): 69(0) -> Fans: level 4
Mai 30 13:11:51 copper thinkfan[635]: Temperatures(bias): 63(0) -> Fans: level 3
Mai 30 13:12:13 copper thinkfan[635]: Temperatures(bias): 71(0) -> Fans: level 4
Mai 30 13:12:20 copper thinkfan[635]: Temperatures(bias): 63(0) -> Fans: level 3
Mai 30 13:12:50 copper thinkfan[635]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 13:13:57 copper thinkfan[635]: Temperatures(bias): 74(0) -> Fans: level 4
Mai 30 13:13:59 copper thinkfan[635]: Temperatures(bias): 81(0) -> Fans: level 7
Mai 30 13:14:09 copper thinkfan[635]: Temperatures(bias): 66(0) -> Fans: level 4
Mai 30 13:14:21 copper thinkfan[635]: Temperatures(bias): 61(0) -> Fans: level 3
Mai 30 13:14:31 copper thinkfan[635]: Temperatures(bias): 59(0) -> Fans: level 2
Mai 30 13:15:45 copper thinkfan[635]: Temperatures(bias): 54(0) -> Fans: level 1
Mai 30 13:20:45 copper thinkfan[635]: Temperatures(bias): 49(0) -> Fans: level 0

As you can see, a few minutes it controls fan level, but then I got this system freeze. Without starting thinkfan or zcfan, it properly works, without freezing, but with that annoying noise of my fan.

My system:

Edit: Link to Kernel.org Bugzilla issue: https://bugzilla.kernel.org/show_bug.cgi?id=217548 Link to Lenovo Forums topic: https://forums.lenovo.com/t5/ThinkPad-T400-T500-and-newer-T-series-Laptops/ThinkPad-T14-Gen-3-21CF-kernel-freezes-when-controlling-fans-on-Linux/m-p/5252479

simonsystem avatar May 31 '23 10:05 simonsystem

Hi @simonsystem, you're writing "system freeze", so by that you mean that the entire system freezes? Or is it just thinkfan that freezes (i.e. stops doing anything)?

If the entire system locks up then there's probably not much thinkfan can do about it because that would be an issue with your kernel and/or drivers. You might try disabling individual sensors to find out which sensor (or fan) is triggering the freeze.

If it's just thinkfan that freezes, you could get more information with strace:

sudo strace -p `pgrep thinkfan`

And post the output here.

vmatare avatar Jun 06 '23 18:06 vmatare

@vmatare , I appear to have the same problem as @simonsystem . In my case, the whole system freezes. Are there any useful diagnostics to pull, in this case?

top-on avatar Jun 13 '23 17:06 top-on

Hi @simonsystem, you're writing "system freeze", so by that you mean that the entire system freezes? Or is it just thinkfan that freezes (i.e. stops doing anything)?

If the entire system locks up then there's probably not much thinkfan can do about it because that would be an issue with your kernel and/or drivers. You might try disabling individual sensors to find out which sensor (or fan) is triggering the freeze.

If it's just thinkfan that freezes, you could get more information with strace:

sudo strace -p `pgrep thinkfan`

And post the output here.

No, it's the whole system that freezes. without any logging to dmesg or similar. I think its an thinkpad_acpi related issue. I will create an issue there and link that to this issue.

@top-on May you post your system specs here as well? Is it also a Thinkpad P14s Gen3 Machine?

simonsystem avatar Jun 13 '23 18:06 simonsystem

This is my system, which also freezes after a random time when running thinkfan:

  • Laptop: Thinkpad P14s Gen3 AMD
  • Distro: Pop!_OS 22.04 LTS
  • Kernel: 6.2.6
  • BIOS version: 1.35

Maybe noteworthy: I am observing the same freezing behavior when running fancontrol.service or CoolerControl.

@simonsystem , thank you for creating and linking that issue!

top-on avatar Jun 13 '23 19:06 top-on

Added a link to a freshly created Kernel.org Bugzilla issue at: https://bugzilla.kernel.org/show_bug.cgi?id=217548

@top-on Thanks for your system specs. Hope, we can help fixing that issue.

simonsystem avatar Jun 13 '23 21:06 simonsystem

That sounds very inconvenient. Have any of you tried to find out how badly the system is frozen? Because sometimes (though mostly on Display-related problems) it's only the graphical UI (X, Wayland etc.) that freezes, but the Linux text consoles continue to work. So sometimes you can still use Strg-Alt-F1 through Strg-Alt-F6 to pull up one of the text consoles, log in there and check the kernel log with dmesg.

Another important test is whether the NumLock LED will still switch on & off. If it doesn't, that means your entire kernel is frozen and there's truly nothing left to do except hard reset.

vmatare avatar Jun 19 '23 15:06 vmatare

@vmatare , i can confirm that the system fully freezes in these cases: changing the interface with Strg-Alt-F6 is not possible when frozen. because i do not have an numblock on my keyboard, i currently cannot check the LED.

i have tested thinkfan also with the new BIOS version for the laptop model: 0.1.28. the other system parameters remained as above. unfortunately, the system also freezes with this new BIOS version.

just for a cross-reference that might be useful, i currently see greater system stability with the coolero flatpak and the latest BIOS, which however also froze at some point with the previous BIOS version. i will run coolero now for a few weeks with the latest BIOS, to see if that is more stable than before.

top-on avatar Jul 23 '23 09:07 top-on

I have to report that the coolero also (fully) freezes my system with the above-mentioned parameters. It freezes somewhat later than with thinkfan, though :thinking:

I will re-run the tests whenever a new kernel will be shipped to pop_OS!, or a new BIOS gets released.

top-on avatar Jul 25 '23 19:07 top-on

  • Laptop: Thinkpad T14 Gen3 AMD (21CF)
  • Distro: Debian Bookworm (Stable)
  • Kernel: 6.1.0
  • BIOS: 1.35 (newest)
  • Thinkfan 1.3.1

Thinkfan was causing freezes so I was searching for another solution for dumb stock fan control (pulsing, delayed reaction to temperature rise). I would like to report that using pwmconfig from lm-sensors also causing freezes. After freeze - changing keyboard backlight is working (don't know if it's helpful).

PiotrTD5 avatar Jul 25 '23 22:07 PiotrTD5

  • Laptop: Thinkpad T14 Gen3 AMD (21CF)

@PiotrTD5 This ticket only concerns P14s Gen3 AMD models. Even though, your BIOS has the same version number, I cannot confirm that we are talking about the same issue. I want to avoid this ticket to be a general thinkpad-freeze issue. Please open another ticket for your laptop model and reference this ticket to it.

Edit: @PiotrTD5 You are right. My fault, I also think now, that yours is the same.

simonsystem avatar Jul 25 '23 23:07 simonsystem

I just wanted to help. The only difference between P14s Gen3 AMD and T14 Gen3 AMD is model name on LCD bezel and stickers.

They share same BIOS/EC firmware. From official Lenovo BIOS update readme: Support models:

  • ThinkPad T14 Gen 3 (Machine Types:21CF,21CG)
  • ThinkPad T16 Gen 1 (Machine Types:21CH,21CJ)
  • ThinkPad P16s Gen 1 (Machine Types:21CK,21CL)
  • ThinkPad P14s Gen 3 (Machine Types:21J5,21J6)

Also, if you study pcsupport.lenovo.com, parts category, you'll find out that 21J5 and 21CF share the same FRU numbers for motherboards. I don't know about T16 vs P16s and I don't have time to check.

So IMHO, you should add T14 Gen3 AMD model to this issue instead creating another. Don't know why you strictly want it to be P14s Gen3 issue when technically it's the same hardware and firmware. I have zero experience in using github so I'll do what you ask if I am really wrong about this.

PiotrTD5 avatar Jul 26 '23 13:07 PiotrTD5

the same happens on my ThinkPad P16s Gen 1: total system freeze some time after thinkfan starts

p345123 avatar Aug 02 '23 04:08 p345123

I've got a T14 G3 AMD with the same issue of kernel freezing after awhile of usage.

However with experimental=1 and fan_control=1 modprobe params i can stull echo levels, timeout, enable, disable, disengage into /proc/acpi/ibm/fan without the kernel freezing on me.

Lillecarl avatar Sep 07 '23 12:09 Lillecarl

I wrote my own shitty Python script as a thinkfan "replacement" and noticed that this happens when we write levels frequently to the fan control file. I built the script so that it checks the current level and compares with what I'd like to set and it seems to be rather "stable" for me now.

https://gist.github.com/Lillecarl/15b683c3cd3bafe74ca3c4dafd427d2e This is the script i used for my testing, keeps my laptop silent for the most part but will ramp the fan all the way up to full-speed (not sure if that's dangerous for the fan or not) if temperatures are high

EDIT: Further testing indicates I was just lucky in the beginning. After realizing i have to write to the fan control file every 110 seconds (after setting watchdog to 120) I started experiencing random lockups again. (Only writes reset the watchdog timeout, which I think is a good idea to keep active if fan control software crashes).

Lillecarl avatar Sep 07 '23 14:09 Lillecarl

https://forums.lenovo.com/t5/ThinkPad-T400-T500-and-newer-T-series-Laptops/ThinkPad-T14-Gen-3-21CF-kernel-freezes-when-controlling-fans-on-Linux/m-p/5252479 Reported to Lenovo forums too

Lillecarl avatar Sep 12 '23 13:09 Lillecarl

EDIT: Further testing indicates I was just lucky in the beginning. After realizing i have to write to the fan control file every 110 seconds (after setting watchdog to 120) I started experiencing random lockups again. (Only writes reset the watchdog timeout, which I think is a good idea to keep active if fan control software crashes).

@Lillecarl , i really liked your idea of boiling down fan control to "read temperature" and "reduce fan speed for X seconds". i tested a simplified version of your script, but it also completely freezes my machine after some time. it was worth a shot, though :slightly_smiling_face:

top-on avatar Sep 18 '23 21:09 top-on

BTW: As a workaround, I switched my notebook to "Cool 'n' Quiet" mode in BIOS and completely disabled thinkfan. I think I lost performance, but its not as loud as before. But its not the solution, of course.

@all: Thanks for all your suggestions and assistance in analyzing this issue. @PiotrTD5: Sorry, that I didnt realize, your issue is really the same thing. @Lillecarl: Special thanks for your scripting tests. Good idea, but poorly... nah.

simonsystem avatar Sep 21 '23 19:09 simonsystem

@simonsystem I've been able to control my fans reliably by always stepping through level 1 before level 0.

image

That's 3 hours, controlling the fans with software all the time.

Please ignore the steep stepping up and down, my control software isn't as polished as thinkfan, although I've got some nice ideas involving reading CPU Package Power from the MSR and use that to step the fans based on actual heat dissipation needs like https://github.com/hirschmann/nbfc does for Windows

EDIT: false...... further natural testing by stressing the cpu every 30-60 seconds got another hang. On the bright side, after switching randomly between levels 1-7 I've discovered that it's going to 0 that freezes the system, no other levels https://prints.lillecarl.com/20231012-225047_lldegbbcjk.png

Lillecarl avatar Oct 12 '23 14:10 Lillecarl

I've got exactly the same issue with my P14s Gen3 AMD. For now I completely disabled thinkfan (otherwise, I had a freeze every few minutes, looks like a kernel panic because the the REISUB does not respond).

ishfx avatar Oct 16 '23 10:10 ishfx

@Lillecarl Sir, you're a lifesaver! I've been pulling my hair due to random hard freezes as mentioned above and it took me some time to pinpoint this issue onto fan control. Albeit I can confirm that not using level 0 mitigates any freezes on my machine.

a-rasinski avatar Oct 19 '23 10:10 a-rasinski

Didn't use this utility myself, but found out about it today because someone pointed me specifically to this issue. I'll have to take a closer look, but the issue, as others have noticed here, too, seems to be related to the fan speed levels. As per CMake, they can either be numeric values in the 0-7, or 0-255 ranges (https://github.com/vmatare/thinkfan/blob/master/src/thinkfan.conf.5.cmake#L439). The 0-7 range may not be handled properly when adding the fan speed levels here: https://github.com/vmatare/thinkfan/blob/master/src/config.cpp#L106

The config shown here sets the disengaged level as the last level to be added, which at first glance should map to std::numeric_limits<int>::min();.

I'm not going to speculate any further as to how that might contribute to this bug without cloning the repo and going through the code itself, but that would be where I'd look, so thought I'd mention it here.

Some shameless self-promotion: I found out about this because I hacked together a small utility to manage fan speeds on my old thinkpad (GTK+3, old school C). It's nowhere near as feature complete as this tool, but maybe some of you here can use it until this bug gets fixed: https://github.com/EVODelavega/fan_control

EVODelavega avatar Oct 26 '23 13:10 EVODelavega

Guys, this is clearly a kernel bug (or most probably in the thinkpad_acpi kernel module). You need to check the kernel.org bugtracker and potentially report it there.

vmatare avatar Nov 01 '23 19:11 vmatare

/sys/class/power_supply/BAT0/hwmon0/subsystem/hwmon1/pwm1

Setting values there to (255/7)*level doesn't lock up my machine.

Lillecarl avatar Nov 13 '23 15:11 Lillecarl

https://download.lenovo.com/pccbbs/mobiles/r23uj73wd.html - (New) Change to permit fan rotation after fan error happen.

Lillecarl avatar Jan 10 '24 11:01 Lillecarl

https://download.lenovo.com/pccbbs/mobiles/r23uj73wd.htm - (New) Change to permit fan rotation after fan error happen.

@Lillecarl Did you try it? Does it solve our issue? Sounds promising!

simonsystem avatar Jan 10 '24 13:01 simonsystem

@simonsystem Yep, it's finally working! The EC fancontrol is also quite decent, so I rewrote my fancontrol script to turn fans off if average temp is below 60 for 30 seconds, and turn to auto if average temperature is above 60 for 30 seconds or above 70 for one measurement. https://github.com/Lillecarl/nixos/blob/master/scripts/fancontrol2.py It can be simplified further but it's got legacy from previous attempts at things 😄

I reckon we can close this? If the new UEFI and EC is out for your model too 😄

Lillecarl avatar Jan 10 '24 14:01 Lillecarl

At least for P14s Gen3 (21J5), this BIOS version isn't available anymore. https://pcsupport.lenovo.com/us/en/products/laptops-and-netbooks/thinkpad-p-series-laptops/thinkpad-p14s-gen-3-type-21j5-21j6/downloads/ds557681-bios-update-utility-bootable-cd-for-windows-10-64-bit-thinkpad-t14-gen-3-type-21cf-21cg-t16-gen-1-type-21ch-21cj-p16s-gen-1type-21ck-21cl?category=BIOS%2FUEFI

This BIOS version R23UJ73W is reported Lenovo cloud not working issue, hence it has been withdrawn from support site.

I downloaded it, once it was available. The fan issue was gone, I could set my fan to 0 without freezes.

But I got standby issues. The system now freezes, when coming back from deep standby, after staying at sleep for an hour or so. Poorly, there is no BIOS option for changing the standby mode, so I cannot try other modes. I think it's fixed to "Modern Standby", which is maybe not well supported by Linux. I'm not an expert in these hardware things. (https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate)

@Lillecarl So, nah, BIOS version 1.49 (R23UJ73WD) has been withdrawn. So, it's not closed yet, isn't it? How about your model, is that BIOS version still available?

simonsystem avatar Jan 16 '24 00:01 simonsystem

@simonsystem It's withdrawn for T14 G3 as well. Meme company. I'm using s2idle, on the AMD system it draws just 30% per 2 days or so so it's good enough for me.

@lillecarl:matrix.org if you wanna keep discussing, this is already miles offtopic from thinkfan 😄

Lillecarl avatar Jan 16 '24 01:01 Lillecarl

It's withdrawn for T14 G3 as well. Meme company.

@Lillecarl Sorry for putting my 2 cents to the offtop, but this is mildly infuriating as it's the second bios version withdrawn in a row to which I've updated. Previous withdrawn one could brick the device, I hope this one won't. Meme company indeed.

a-rasinski avatar Jan 16 '24 10:01 a-rasinski

From that Lenovo thread it seems like a proper fix might take another while. In the meantime, another possible workaround is using "level auto" instead of speed 0 for the idle fan speed setting. This does turn off the fan for sufficiently low temperatures, though I have not found the exact boundary yet.

KiitoX avatar Jan 22 '24 05:01 KiitoX