[FeatureRequest] Zen5/Zen5c Epyc support
Is your feature request related to a problem? Please describe.
I'm trying to get memory and pcie bandwidth statistics; the cpu is AMD EPYC 9275F which belongs to the Zen5 family, which is currently not supported by likwid.
Describe the solution you'd like
Add support for Zen5/Zen5c.
Describe alternatives you've considered None.
Additional context None.
Thanks for your request. We will add Zen5 as soon as we have access to such a system AND the documentation of it. If Zen5 and Zen5c are different in their hardware support, we need such a system and its documentation as well.
i'd like to second this request:
[root@gpu3-01 ~]# likwid-perfctr -e ERROR - [/scratch/source/likwid-5.4.1/src/perfmon.c:perfmon_init_maps:1608] Unsupported Processor ERROR - [/scratch/source/likwid-5.4.1/src/perfmon.c:perfmon_check_counter_map:759] Counter and event maps not initialized. This architecture has 0 counters. Counter tags(name, type): This architecture has 0 events. Event tags (tag, id, umask, counters): [root@gpu3-01 ~]# likwid-perfctr -i -------------------------------------------------------------------------------- CPU name: AMD EPYC 9555 64-Core Processor CPU type: nil CPU clock: 3.20 GHz CPU family: 26 CPU model: 2 CPU vendor: 0 CPU part: 0 CPU short: nil CPU stepping: 1 CPU features: FP MMX SSE SSE2 HTT MMX RDTSCP MONITOR SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3 CPU arch: x86_64 PERFMON supports Uncore: 0 --------------------------------------------------------------------------------
We have decent new nodes with AMD EPYC 9555 processors i'd like to monitor. If you want us to run any discovery scripts, development versions or the like on these processors, just let me know.
Please provide the output of /proc/cpuinfo and ls /sys/devices
Thanks for the answer.
Sure, here you go:
For said AMD EPYC 9555, it is
/proc/cpuinfo:
[root@gpu3-01 ~]# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 26 model : 2 model name : AMD EPYC 9555 64-Core Processor stepping : 1 microcode : 0xb00211a cpu MHz : 3200.000 cache size : 1024 KB physical id : 0 siblings : 128 core id : 0 cpu cores : 64 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 16 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_ tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowp refetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnn i avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spe c_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersec t flush_l1d debug_swap amd_lbr_pmc_freeze bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass bogomips : 6390.20 TLB size : 192 4K pages clflush size : 64 cache_alignment : 64 address sizes : 52 bits physical, 57 bits virtual power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] ...
(full output attached, it's a dual socket machine with SMT enabled, so essentially 256 times the same content)
/sys/devices:
[root@gpu3-01 ~]# ls /sys/devices/ amd_df amd_iommu_3 amd_iommu_7 amd_umc_10 amd_umc_14 amd_umc_18 amd_umc_21 amd_umc_4 amd_umc_8 ibs_fetch msr pci0000:30 pci0000:70 pci0000:b0 pci0000:f0 software virtual amd_iommu_0 amd_iommu_4 amd_l3 amd_umc_11 amd_umc_15 amd_umc_19 amd_umc_22 amd_umc_5 amd_umc_9 ibs_op pci0000:00 pci0000:40 pci0000:80 pci0000:c0 platform system amd_iommu_1 amd_iommu_5 amd_umc_0 amd_umc_12 amd_umc_16 amd_umc_2 amd_umc_23 amd_umc_6 breakpoint kprobe pci0000:10 pci0000:50 pci0000:90 pci0000:d0 pnp0 tracepoint amd_iommu_2 amd_iommu_6 amd_umc_1 amd_umc_13 amd_umc_17 amd_umc_20 amd_umc_3 amd_umc_7 cpu LNXSYSTM:00 pci0000:20 pci0000:60 pci0000:a0 pci0000:e0 power uprobe
Anything else that might be relevant or useful?
Those should be the relevant docs: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/57238.zip
@wkgcass : Does your system have the same cpu family and model?
cpu family : 26
model : 2
@TomTheBear hi, sorry for the delay.
Yes, exactly 26 and 2, here's my full lscpu output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9275F 24-Core Processor
CPU family: 26
Model: 2
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 79%
CPU max MHz: 4816.6992
CPU min MHz: 1500.0000
BogoMIPS: 8199.83
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpui
d extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe p
opcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dno
wprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l
3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi
1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb
avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc
arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthres
hold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni v
aes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir6
4b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 2.3 MiB (48 instances)
L1i: 1.5 MiB (48 instances)
L2: 48 MiB (48 instances)
L3: 512 MiB (16 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS N
ot affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
Can someone please provide some infos about these /sys/devices/amd_umc_* units.
tail -v -n +1 /sys/devices/amd_umc_0/eventstail -v -n +1 /sys/devices/amd_umc_0/formattail -v -n +1 /sys/devices/power/events
(filenames and content of the events and format folders).
Please check whether the amount of /sys/devices/amd_umc_* folders matches with the output of the attached code.
$ gcc cpuid_umc_count.c -o cpuid_umc_count
$ ./cpuid_umc_count
ebx[24:16] = 0x0
It reads the Extended Performance Monitoring and Debug leaf (0x80000022) through CPUID.
Sure.
/sys/devices/amd_umc_0/events does not exist on my system (Epyc 9555 dual socket, Rocky 9.5). This is the structure of these directories:
[root@gpu3-01 ~]# tree /sys/devices/amd_umc_0/ /sys/devices/amd_umc_0/ ├── cpumask ├── format │ ├── event │ └── rdwrmask ├── perf_event_mux_interval_ms ├── power │ ├── autosuspend_delay_ms │ ├── control │ ├── runtime_active_time │ ├── runtime_status │ └── runtime_suspended_time ├── subsystem -> ../../bus/event_source ├── type └── uevent
But in case it helps, these are the contents of the files with most similar name:
[root@gpu3-01 ~]# tail -v -n +1 /sys/devices/amd_umc_0/uevent ==> /sys/devices/amd_umc_0/uevent /sys/devices/amd_umc_0/format/event /sys/devices/amd_umc_0/format/rdwrmaskThere are 24 in total on my system (
/sys/devices/amd_umc_[0-24])
/sys/devices/power/eventsis[root@gpu3-01 ~]# tree /sys/devices/power/events /sys/devices/power/events ├── energy-pkg ├── energy-pkg.scale └── energy-pkg.unitwith contents
[root@gpu3-01 ~]# tail -v -n +1 /sys/devices/power/events/* ==> /sys/devices/power/events/energy-pkg /sys/devices/power/events/energy-pkg.scale /sys/devices/power/events/energy-pkg.unitThis is the output of
cpuid_umc_count:[root@gpu3-01 ~]# gcc cpuid_umc_count.c -o cpuid_umc_count [root@gpu3-01 ~]# ./cpuid_umc_count ebx[24:16] = 0x0(same with gcc 11.5.0 and clang/llvm 18.1.8.) Can't tell you though whether this can be considered as "match" as there are 24 of these folders :-).
Anything more?
hi, on my platform (epyc 9275f * 2, gigabyte MZ73-LM2 Rev. 3.x, ubuntu 24.04, 6.8.0-54-generic) /sys/devices/amd_umc_* doesn't exist.
root@exp0:~# ls -lh /sys/devices/amd_umc_*
ls: cannot access '/sys/devices/amd_umc_*': No such file or directory
root@exp0:~# cd /sys/devices/
root@exp0:/sys/devices# ls
amd_iommu_0 amd_iommu_4 breakpoint isa pci0000:00 pci0000:40 pci0000:80 pci0000:c0 platform system
amd_iommu_1 amd_iommu_5 cpu kprobe pci0000:10 pci0000:50 pci0000:90 pci0000:d0 pnp0 tracepoint
amd_iommu_2 amd_iommu_6 ibs_fetch LNXSYSTM:00 pci0000:20 pci0000:60 pci0000:a0 pci0000:e0 power uprobe
amd_iommu_3 amd_iommu_7 ibs_op msr pci0000:30 pci0000:70 pci0000:b0 pci0000:f0 software virtual
Is there anything I should configure?
@wkgcass Whether perf units are available or not is commonly Linux kernel version related or get enabled by distribution specific patch sets.
Thanks @behnle . It would be important to understand how the umc units are distributed. There are in total 64 units specified in the docs. Your system provides only 24. Can you please check the code and add an output for ecx as well. I need to know whether 0-23 are active in the UMC mask or are there gaps in between.
Way out of my comfort zone...
I added the line
printf("ecx[31:0] = 0x%08X\n", ecx );
assuming that in the inline assembler call ecx is actually filled with data and not only used as placeholder.
Is this what you had in mind? This is the result:
[root@gpu3-01 ~]# ./cpuid_umc_count ebx[24:16] = 0x0 ecx[31:0] = 0x00000000
So if i interpret the result correctly, the entire bitmask is zero. Does this make sense and help?
Some maybe dumb beginner's questions:
- The doc says "calculate the number of PMCs as Core.../POPCNT..." -> would this mean that (EBX and ECX being all zero) there are none?
- Inside my
amd_umc_Xdirectories, there are noeventsfiles. Would that be in line with the other finding? - Could it be that i have to first enable these? Maybe a kernel command? Or a BIOS setting?
It's unfortunate that the hardware does not report the required data. The Linux kernel gets there data from cpuid leaf 0x80000022 as well.
It happens for some units that no events are specified. Those are just examples, the folders never contain all possible events. The important information for LIKWID is the type file and the contents of the format folder.
I created initial support for AMD Zen5: https://github.com/RRZE-HPC/likwid/pull/688
Please provide also the files and contents in /sys/devices/amd_df/format. Thanks
No problem, here you go. The directory content:
[root@gpu3-01 format]# tree /sys/devices/amd_df/format/ /sys/devices/amd_df/format/ ├── event └── umask
The file contents:
[root@gpu3-01 format]# tail -v -n +1 /sys/devices/amd_df/format/* ==> /sys/devices/amd_df/format/event /sys/devices/amd_df/format/umaskAnything else?
No, I think I have everything now. I was just wondering because the DataFabric units have event and umask in the docs but are both split up for writing (as you can see in the format outputs). It might have been the case that perf_event expects them separated:event=config:0-7, event_ext=config:32-37, umask=config:8-15, umask_ext=config:24-27. I have seen that in the past.