htop icon indicating copy to clipboard operation
htop copied to clipboard

Temperature reporting fix for AMD Zen CPUs

Open JohnsonGlenT opened this issue 2 years ago • 13 comments

Reports Package temperature across CPU dies for AMD Zen CPU's. Temperature is now selected by package die Tccd[X] for for temperature reporting and ignores the Tctl reporting.

Tested and Validated with: Ryzen 3600 on Artix Linux Ryzen Threadripper 3970x on Arch Linux

JohnsonGlenT avatar Jan 27 '23 02:01 JohnsonGlenT

Hi, I compiled with these patches and it seems to not work correctly.

This is on a Ryzen 5900X.

Screenshot_2023-02-27_16-35-01

CyberX5 avatar Feb 27 '23 15:02 CyberX5

Hi, I compiled with these patches and it seems to not work correctly.

This is on a Ryzen 5900X.

Screenshot_2023-02-27_16-35-01

can you submit the output of sensors?

JohnsonGlenT avatar Mar 04 '23 03:03 JohnsonGlenT

Oh sorry about that. I was thinking of including that initially.

sensors k10temp-pci-00c3

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +41.4°C
Tccd1:        +41.2°C
Tccd2:        +38.2°C
sensors -u k10temp-pci-00c3

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:
  temp1_input: 44.125
Tccd1:
  temp3_input: 37.750
Tccd2:
  temp4_input: 39.250

CyberX5 avatar Mar 04 '23 13:03 CyberX5

@CyberX5 Can you run and upload the ryzen.log that https://github.com/JohnsonGlenT/htop/tree/ryzen5900X branch creates.

JohnsonGlenT avatar Mar 11 '23 23:03 JohnsonGlenT

Just adding a test result (compiled your fresh logging branch)

7600X works great - top is current stable, bottom is your branch image

C0rn3j avatar Mar 12 '23 00:03 C0rn3j

@CyberX5 Can you run and upload the ryzen.log that https://github.com/JohnsonGlenT/htop/tree/ryzen5900X branch creates.

Sure :slightly_smiling_face: ryzen.log

CyberX5 avatar Mar 12 '23 12:03 CyberX5

@CyberX5 What OS/kernel/Distro are you on?

From the log, your system seems to be reporting coreTempCount as 3. I am assuming it is counting the Tctl, which so far seems to be an issue not with your CPU, but it may be how htop is dealing with polling the k10temp sensors output on older kernels. If this issue will exist on major versions of the kernel, namely LTS, additional changes may need to be made.

JohnsonGlenT avatar Mar 14 '23 02:03 JohnsonGlenT

@JohnsonGlenT Well, I do want to apologize it does seem to be an issue on my end, I'm on Arch Linux, 6.1 LTS Kernel, but I run a custom kernel config and I'm guessing that's the problem, now I do keep that in mind so I did test this on the stock Arch kernel before reporting, but must have messed something up...

ryzen-6.1.19-1-lts.log Here's the log from the stock Arch LTS kernel. CoreTempCount is 2 now.

Also, I have no idea what config option could be causing this... Haven't seen anything like this in any other program :shrug:

Now, I have encountered a different issue.

Screenshot_2023-03-14_04-43-55

To my understanding the CCD layout of this CPU is 0-5,12-17 = CCD1 and 6-11,18-23 = CCD2.

lstopo for reference: Screenshot_2023-03-14_04-58-08

I actually have the columns set to 2 for this reason, so visually 1 row would contain both threads of a core. For example thread 0 and 12 would be core 0.

Here's a command I found that lists sibling threads. Found here.

grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort -n -t ':' -k 2 -u
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,12
/sys/devices/system/cpu/cpu1/topology/thread_siblings_list:1,13
/sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2,14
/sys/devices/system/cpu/cpu3/topology/thread_siblings_list:3,15
/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4,16
/sys/devices/system/cpu/cpu5/topology/thread_siblings_list:5,17
/sys/devices/system/cpu/cpu6/topology/thread_siblings_list:6,18
/sys/devices/system/cpu/cpu7/topology/thread_siblings_list:7,19
/sys/devices/system/cpu/cpu8/topology/thread_siblings_list:8,20
/sys/devices/system/cpu/cpu9/topology/thread_siblings_list:9,21
/sys/devices/system/cpu/cpu10/topology/thread_siblings_list:10,22
/sys/devices/system/cpu/cpu11/topology/thread_siblings_list:11,23

And again sorry about that :heart:

CyberX5 avatar Mar 14 '23 04:03 CyberX5

My test log above with stable htop that is mostly broken is from I believe Arch Linux with stock 6.2.2 kernel.

C0rn3j avatar Mar 14 '23 07:03 C0rn3j

To my understanding the CCD layout of this CPU is 0-5,12-17 = CCD1 and 6-11,18-23 = CCD2.

lstopo for reference: Screenshot_2023-03-14_04-58-08

I actually have the columns set to 2 for this reason, so visually 1 row would contain both threads of a core. For example thread 0 and 12 would be core 0.

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

JohnsonGlenT avatar Mar 14 '23 15:03 JohnsonGlenT

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. :slightly_smiling_face:

CyberX5 avatar Mar 14 '23 15:03 CyberX5

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. slightly_smiling_face

I have confirmed that htop reads CPU number by physical id. I will update when I have a fix that is ready for testing.

JohnsonGlenT avatar Mar 15 '23 05:03 JohnsonGlenT

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. slightly_smiling_face

I have confirmed that htop reads CPU number by physical id. I will update when I have a fix that is ready for testing.

siblist.txt

So, as of doing looking with other zen architectures (threadripper 3970X, file attached) there seems to be no uniform way to ensure mapping of the CPU thread siblings will receive the same temperature reading, except for polling the thread siblings on every update tick. In this case I think mapping the temp by logical process number is a good enough solution without possibly affecting other systems, or needing to overhaul CPU startup/update loop.

JohnsonGlenT avatar Mar 16 '23 02:03 JohnsonGlenT