L3 cache topology information on intel chip with numa architecture is not accurate
I ran the topology example on an INTEL(R) XEON(R) GOLD 6542Y, but it didn't seem to match the results of lscpu -e.
The following is the cpu information
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
CPU family: 6
Model: 207
Model name: INTEL(R) XEON(R) GOLD 6542Y
BIOS Model name: INTEL(R) XEON(R) GOLD 6542Y
Stepping: 2
CPU MHz: 3564.866
CPU max MHz: 2901.0000
CPU min MHz: 800.0000
BogoMIPS: 5800.00
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 61440K
NUMA node0 CPU(s): 0-11,48-59
NUMA node1 CPU(s): 12-23,60-71
NUMA node2 CPU(s): 24-35,72-83
NUMA node3 CPU(s): 36-47,84-95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm uintr md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Results for the topology of ghw
topology NUMA (4 nodes)
node #0 (12 cores)
L1i cache (32 KB) shared with logical processors: 0,48
L1i cache (32 KB) shared with logical processors: 1,49
L1i cache (32 KB) shared with logical processors: 2,50
L1i cache (32 KB) shared with logical processors: 3,51
L1i cache (32 KB) shared with logical processors: 4,52
L1i cache (32 KB) shared with logical processors: 5,53
L1i cache (32 KB) shared with logical processors: 6,54
L1i cache (32 KB) shared with logical processors: 7,55
L1i cache (32 KB) shared with logical processors: 8,56
L1i cache (32 KB) shared with logical processors: 9,57
L1i cache (32 KB) shared with logical processors: 10,58
L1i cache (32 KB) shared with logical processors: 11,59
L1d cache (32 KB) shared with logical processors: 0,48
L1d cache (32 KB) shared with logical processors: 1,49
L1d cache (32 KB) shared with logical processors: 2,50
L1d cache (32 KB) shared with logical processors: 3,51
L1d cache (32 KB) shared with logical processors: 4,52
L1d cache (32 KB) shared with logical processors: 5,53
L1d cache (32 KB) shared with logical processors: 6,54
L1d cache (32 KB) shared with logical processors: 7,55
L1d cache (32 KB) shared with logical processors: 8,56
L1d cache (32 KB) shared with logical processors: 9,57
L1d cache (32 KB) shared with logical processors: 10,58
L1d cache (32 KB) shared with logical processors: 11,59
L2 cache (2048 KB) shared with logical processors: 0,48
L2 cache (2048 KB) shared with logical processors: 1,49
L2 cache (2048 KB) shared with logical processors: 2,50
L2 cache (2048 KB) shared with logical processors: 3,51
L2 cache (2048 KB) shared with logical processors: 4,52
L2 cache (2048 KB) shared with logical processors: 5,53
L2 cache (2048 KB) shared with logical processors: 6,54
L2 cache (2048 KB) shared with logical processors: 7,55
L2 cache (2048 KB) shared with logical processors: 8,56
L2 cache (2048 KB) shared with logical processors: 9,57
L2 cache (2048 KB) shared with logical processors: 10,58
L2 cache (2048 KB) shared with logical processors: 11,59
L3 cache (61440 KB) shared with logical processors: 0,1,2,3,4,5,6,7,8,9,10,11,48,49,50,51,52,53,54,55,56,57,58,59
node #1 (12 cores)
L1i cache (32 KB) shared with logical processors: 12,60
L1i cache (32 KB) shared with logical processors: 13,61
L1i cache (32 KB) shared with logical processors: 14,62
L1i cache (32 KB) shared with logical processors: 15,63
L1i cache (32 KB) shared with logical processors: 16,64
L1i cache (32 KB) shared with logical processors: 17,65
L1i cache (32 KB) shared with logical processors: 18,66
L1i cache (32 KB) shared with logical processors: 19,67
L1i cache (32 KB) shared with logical processors: 20,68
L1i cache (32 KB) shared with logical processors: 21,69
L1i cache (32 KB) shared with logical processors: 22,70
L1i cache (32 KB) shared with logical processors: 23,71
L1d cache (32 KB) shared with logical processors: 12,60
L1d cache (32 KB) shared with logical processors: 13,61
L1d cache (32 KB) shared with logical processors: 14,62
L1d cache (32 KB) shared with logical processors: 15,63
L1d cache (32 KB) shared with logical processors: 16,64
L1d cache (32 KB) shared with logical processors: 17,65
L1d cache (32 KB) shared with logical processors: 18,66
L1d cache (32 KB) shared with logical processors: 19,67
L1d cache (32 KB) shared with logical processors: 20,68
L1d cache (32 KB) shared with logical processors: 21,69
L1d cache (32 KB) shared with logical processors: 22,70
L1d cache (32 KB) shared with logical processors: 23,71
L2 cache (2048 KB) shared with logical processors: 12,60
L2 cache (2048 KB) shared with logical processors: 13,61
L2 cache (2048 KB) shared with logical processors: 14,62
L2 cache (2048 KB) shared with logical processors: 15,63
L2 cache (2048 KB) shared with logical processors: 16,64
L2 cache (2048 KB) shared with logical processors: 17,65
L2 cache (2048 KB) shared with logical processors: 18,66
L2 cache (2048 KB) shared with logical processors: 19,67
L2 cache (2048 KB) shared with logical processors: 20,68
L2 cache (2048 KB) shared with logical processors: 21,69
L2 cache (2048 KB) shared with logical processors: 22,70
L2 cache (2048 KB) shared with logical processors: 23,71
L3 cache (61440 KB) shared with logical processors: 12,13,14,15,16,17,18,19,20,21,22,23,60,61,62,63,64,65,66,67,68,69,70,71
node #2 (12 cores)
L1i cache (32 KB) shared with logical processors: 24,72
L1i cache (32 KB) shared with logical processors: 25,73
L1i cache (32 KB) shared with logical processors: 26,74
L1i cache (32 KB) shared with logical processors: 27,75
L1i cache (32 KB) shared with logical processors: 28,76
L1i cache (32 KB) shared with logical processors: 29,77
L1i cache (32 KB) shared with logical processors: 30,78
L1i cache (32 KB) shared with logical processors: 31,79
L1i cache (32 KB) shared with logical processors: 32,80
L1i cache (32 KB) shared with logical processors: 33,81
L1i cache (32 KB) shared with logical processors: 34,82
L1i cache (32 KB) shared with logical processors: 35,83
L1d cache (32 KB) shared with logical processors: 24,72
L1d cache (32 KB) shared with logical processors: 25,73
L1d cache (32 KB) shared with logical processors: 26,74
L1d cache (32 KB) shared with logical processors: 27,75
L1d cache (32 KB) shared with logical processors: 28,76
L1d cache (32 KB) shared with logical processors: 29,77
L1d cache (32 KB) shared with logical processors: 30,78
L1d cache (32 KB) shared with logical processors: 31,79
L1d cache (32 KB) shared with logical processors: 32,80
L1d cache (32 KB) shared with logical processors: 33,81
L1d cache (32 KB) shared with logical processors: 34,82
L1d cache (32 KB) shared with logical processors: 35,83
L2 cache (2048 KB) shared with logical processors: 24,72
L2 cache (2048 KB) shared with logical processors: 25,73
L2 cache (2048 KB) shared with logical processors: 26,74
L2 cache (2048 KB) shared with logical processors: 27,75
L2 cache (2048 KB) shared with logical processors: 28,76
L2 cache (2048 KB) shared with logical processors: 29,77
L2 cache (2048 KB) shared with logical processors: 30,78
L2 cache (2048 KB) shared with logical processors: 31,79
L2 cache (2048 KB) shared with logical processors: 32,80
L2 cache (2048 KB) shared with logical processors: 33,81
L2 cache (2048 KB) shared with logical processors: 34,82
L2 cache (2048 KB) shared with logical processors: 35,83
L3 cache (61440 KB) shared with logical processors: 24,25,26,27,28,29,30,31,32,33,34,35,72,73,74,75,76,77,78,79,80,81,82,83
node #3 (12 cores)
L1i cache (32 KB) shared with logical processors: 36,84
L1i cache (32 KB) shared with logical processors: 37,85
L1i cache (32 KB) shared with logical processors: 38,86
L1i cache (32 KB) shared with logical processors: 39,87
L1i cache (32 KB) shared with logical processors: 40,88
L1i cache (32 KB) shared with logical processors: 41,89
L1i cache (32 KB) shared with logical processors: 42,90
L1i cache (32 KB) shared with logical processors: 43,91
L1i cache (32 KB) shared with logical processors: 44,92
L1i cache (32 KB) shared with logical processors: 45,93
L1i cache (32 KB) shared with logical processors: 46,94
L1i cache (32 KB) shared with logical processors: 47,95
L1d cache (32 KB) shared with logical processors: 36,84
L1d cache (32 KB) shared with logical processors: 37,85
L1d cache (32 KB) shared with logical processors: 38,86
L1d cache (32 KB) shared with logical processors: 39,87
L1d cache (32 KB) shared with logical processors: 40,88
L1d cache (32 KB) shared with logical processors: 41,89
L1d cache (32 KB) shared with logical processors: 42,90
L1d cache (32 KB) shared with logical processors: 43,91
L1d cache (32 KB) shared with logical processors: 44,92
L1d cache (32 KB) shared with logical processors: 45,93
L1d cache (32 KB) shared with logical processors: 46,94
L1d cache (32 KB) shared with logical processors: 47,95
L2 cache (2048 KB) shared with logical processors: 36,84
L2 cache (2048 KB) shared with logical processors: 37,85
L2 cache (2048 KB) shared with logical processors: 38,86
L2 cache (2048 KB) shared with logical processors: 39,87
L2 cache (2048 KB) shared with logical processors: 40,88
L2 cache (2048 KB) shared with logical processors: 41,89
L2 cache (2048 KB) shared with logical processors: 42,90
L2 cache (2048 KB) shared with logical processors: 43,91
L2 cache (2048 KB) shared with logical processors: 44,92
L2 cache (2048 KB) shared with logical processors: 45,93
L2 cache (2048 KB) shared with logical processors: 46,94
L2 cache (2048 KB) shared with logical processors: 47,95
L3 cache (61440 KB) shared with logical processors: 36,37,38,39,40,41,42,43,44,45,46,47,84,85,86,87,88,89,90,91,92,93,94,95
The result of lscpu -e, which is also the same as the result of executing cat /sys/devices/system/cpu/cpu<x>/cache/index3/id
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 2901.0000 800.0000
1 0 0 1 1:1:1:0 yes 2901.0000 800.0000
2 0 0 2 2:2:2:0 yes 2901.0000 800.0000
3 0 0 3 3:3:3:0 yes 2901.0000 800.0000
4 0 0 4 4:4:4:0 yes 2901.0000 800.0000
5 0 0 5 5:5:5:0 yes 2901.0000 800.0000
6 0 0 6 6:6:6:0 yes 2901.0000 800.0000
7 0 0 7 7:7:7:0 yes 2901.0000 800.0000
8 0 0 8 8:8:8:0 yes 2901.0000 800.0000
9 0 0 9 9:9:9:0 yes 2901.0000 800.0000
10 0 0 10 10:10:10:0 yes 2901.0000 800.0000
11 0 0 11 11:11:11:0 yes 2901.0000 800.0000
12 1 0 12 12:12:12:0 yes 2901.0000 800.0000
13 1 0 13 13:13:13:0 yes 2901.0000 800.0000
14 1 0 14 14:14:14:0 yes 2901.0000 800.0000
15 1 0 15 15:15:15:0 yes 2901.0000 800.0000
16 1 0 16 16:16:16:0 yes 2901.0000 800.0000
17 1 0 17 17:17:17:0 yes 2901.0000 800.0000
18 1 0 18 18:18:18:0 yes 2901.0000 800.0000
19 1 0 19 19:19:19:0 yes 2901.0000 800.0000
20 1 0 20 20:20:20:0 yes 2901.0000 800.0000
21 1 0 21 21:21:21:0 yes 2901.0000 800.0000
22 1 0 22 22:22:22:0 yes 2901.0000 800.0000
23 1 0 23 23:23:23:0 yes 2901.0000 800.0000
24 2 1 24 24:24:24:1 yes 2901.0000 800.0000
25 2 1 25 25:25:25:1 yes 2901.0000 800.0000
26 2 1 26 26:26:26:1 yes 2901.0000 800.0000
27 2 1 27 27:27:27:1 yes 2901.0000 800.0000
28 2 1 28 28:28:28:1 yes 2901.0000 800.0000
29 2 1 29 29:29:29:1 yes 2901.0000 800.0000
30 2 1 30 30:30:30:1 yes 2901.0000 800.0000
31 2 1 31 31:31:31:1 yes 2901.0000 800.0000
32 2 1 32 32:32:32:1 yes 2901.0000 800.0000
33 2 1 33 33:33:33:1 yes 2901.0000 800.0000
34 2 1 34 34:34:34:1 yes 2901.0000 800.0000
35 2 1 35 35:35:35:1 yes 2901.0000 800.0000
36 3 1 36 36:36:36:1 yes 2901.0000 800.0000
37 3 1 37 37:37:37:1 yes 2901.0000 800.0000
38 3 1 38 38:38:38:1 yes 2901.0000 800.0000
39 3 1 39 39:39:39:1 yes 2901.0000 800.0000
40 3 1 40 40:40:40:1 yes 2901.0000 800.0000
41 3 1 41 41:41:41:1 yes 2901.0000 800.0000
42 3 1 42 42:42:42:1 yes 2901.0000 800.0000
43 3 1 43 43:43:43:1 yes 2901.0000 800.0000
44 3 1 44 44:44:44:1 yes 2901.0000 800.0000
45 3 1 45 45:45:45:1 yes 2901.0000 800.0000
46 3 1 46 46:46:46:1 yes 2901.0000 800.0000
47 3 1 47 47:47:47:1 yes 2901.0000 800.0000
48 0 0 0 0:0:0:0 yes 2901.0000 800.0000
49 0 0 1 1:1:1:0 yes 2901.0000 800.0000
50 0 0 2 2:2:2:0 yes 2901.0000 800.0000
51 0 0 3 3:3:3:0 yes 2901.0000 800.0000
52 0 0 4 4:4:4:0 yes 2901.0000 800.0000
53 0 0 5 5:5:5:0 yes 2901.0000 800.0000
54 0 0 6 6:6:6:0 yes 2901.0000 800.0000
55 0 0 7 7:7:7:0 yes 2901.0000 800.0000
56 0 0 8 8:8:8:0 yes 2901.0000 800.0000
57 0 0 9 9:9:9:0 yes 2901.0000 800.0000
58 0 0 10 10:10:10:0 yes 2901.0000 800.0000
59 0 0 11 11:11:11:0 yes 2901.0000 800.0000
60 1 0 12 12:12:12:0 yes 2901.0000 800.0000
61 1 0 13 13:13:13:0 yes 2901.0000 800.0000
62 1 0 14 14:14:14:0 yes 2901.0000 800.0000
63 1 0 15 15:15:15:0 yes 2901.0000 800.0000
64 1 0 16 16:16:16:0 yes 2901.0000 800.0000
65 1 0 17 17:17:17:0 yes 2901.0000 800.0000
66 1 0 18 18:18:18:0 yes 2901.0000 800.0000
67 1 0 19 19:19:19:0 yes 2901.0000 800.0000
68 1 0 20 20:20:20:0 yes 2901.0000 800.0000
69 1 0 21 21:21:21:0 yes 2901.0000 800.0000
70 1 0 22 22:22:22:0 yes 2901.0000 800.0000
71 1 0 23 23:23:23:0 yes 2901.0000 800.0000
72 2 1 24 24:24:24:1 yes 2901.0000 800.0000
73 2 1 25 25:25:25:1 yes 2901.0000 800.0000
74 2 1 26 26:26:26:1 yes 2901.0000 800.0000
75 2 1 27 27:27:27:1 yes 2901.0000 800.0000
76 2 1 28 28:28:28:1 yes 2901.0000 800.0000
77 2 1 29 29:29:29:1 yes 2901.0000 800.0000
78 2 1 30 30:30:30:1 yes 2901.0000 800.0000
79 2 1 31 31:31:31:1 yes 2901.0000 800.0000
80 2 1 32 32:32:32:1 yes 2901.0000 800.0000
81 2 1 33 33:33:33:1 yes 2901.0000 800.0000
82 2 1 34 34:34:34:1 yes 2901.0000 800.0000
83 2 1 35 35:35:35:1 yes 2901.0000 800.0000
84 3 1 36 36:36:36:1 yes 2901.0000 800.0000
85 3 1 37 37:37:37:1 yes 2901.0000 800.0000
86 3 1 38 38:38:38:1 yes 2901.0000 800.0000
87 3 1 39 39:39:39:1 yes 2901.0000 800.0000
88 3 1 40 40:40:40:1 yes 2901.0000 800.0000
89 3 1 41 41:41:41:1 yes 2901.0000 800.0000
90 3 1 42 42:42:42:1 yes 2901.0000 800.0000
91 3 1 43 43:43:43:1 yes 2901.0000 800.0000
92 3 1 44 44:44:44:1 yes 2901.0000 800.0000
93 3 1 45 45:45:45:1 yes 2901.0000 800.0000
94 3 1 46 46:46:46:1 yes 2901.0000 800.0000
95 3 1 47 47:47:47:1 yes 2901.0000 800.0000
It looks like this is happening because the code only executes the cache judgment logic within the node itself. Could we solve this problem by adding an id to the cache type and adding a merge logic?
// Inspect the caches for each logical processor. There will be a
// /sys/devices/system/node/nodeX/cpuX/cache directory containing a
// number of directories beginning with the prefix "index" followed by
// a number. The number indicates the level of the cache, which
// indicates the "distance" from the processor. Each of these
// directories contains information about the size of that level of
// cache and the processors mapped to it.
cachePath := filepath.Join(cpuPath, "cache")
if _, err = os.Stat(cachePath); errors.Is(err, os.ErrNotExist) {
continue
}
cacheDirFiles, err := os.ReadDir(cachePath)
if err != nil {
return nil, err
}
for _, cacheDirFile := range cacheDirFiles {
cacheDirFileName := cacheDirFile.Name()
if !strings.HasPrefix(cacheDirFileName, "index") {
continue
}
cacheIndex, _ := strconv.Atoi(cacheDirFileName[5:])
// The cache information is repeated for each node, so here, we
// just ensure that we only have a one Cache object for each
// unique combination of level, type and processor map
level := memoryCacheLevel(ctx, paths, nodeID, lpID, cacheIndex)
cacheType := memoryCacheType(ctx, paths, nodeID, lpID, cacheIndex)
sharedCpuMap := memoryCacheSharedCPUMap(ctx, paths, nodeID, lpID, cacheIndex)
cacheKey := fmt.Sprintf("%d-%d-%s", level, cacheType, sharedCpuMap)
cache, exists := caches[cacheKey]
if !exists {
size := memoryCacheSize(ctx, paths, nodeID, lpID, level)
cache = &Cache{
Level: uint8(level),
Type: cacheType,
SizeBytes: uint64(size) * uint64(unitutil.KB),
LogicalProcessors: make([]uint32, 0),
}
caches[cacheKey] = cache
}
cache.LogicalProcessors = append(
cache.LogicalProcessors,
uint32(lpID),
)
}
The present results give the illusion that four L3 caches are present on the cpu, which does not feel particularly reasonable.
I want to add an id field to the memory.Cache structure and retrieve the id value by /sys/devices/system/node/node<x>/cpu<x>/cache/index<x>/id. Adding the cache id information makes it easier for the user to know that node0 and node1 are the same cache, so I can merge it myself in my user application.
I think it is reasonable to keep the current logical processors information unchanged but add the cache id field. Maybe I can submit a pr to implement it.
Hi @LavenderQAQ thanks so much for submitting this issue and digging into the code to find the root cause! I would welcome any PR you could put together and will review it ASAP.
Thanks @LavenderQAQ for the detailed report. IIUC/IIRC the CPU in question has 2 physical L3 blocks and sub-numa clustering enabled, so there are 4 logical NUMA nodes, each pair of them actually uses 1 of the aforementioned 2 physical L3 blocks. Is my understanding accurate enough?
@ffromani Exactly. So I want to add the cache id form to make it easier to distinguish which numa nodes use the same cache.
@LavenderQAQ makes sense to me, thanks. So it seems GHW is not handling properly the sub-numa clustering feature (or AMD's equivalent). If you want to submit a PR that would be most welcome we will surely review.
@ffromani Thank you very much. I plan to submit a pr to supplement the cache id information in the near future. The main idea is that based on /sys/devices/system/node/node<x>/cpu<x>/cache/index<x>/id information, I need to finish it outside of working hours, which takes some time.