hwloc 32bit domain support

Some virtual environment (at least Microsoft Azure) expose PCI domains > 16bits to be sure they don't conflict with ACPI PCI domains (on 16bits) when using PCI pass through etc. Linux, pciaccess and pciutils support this since 2016/2017. Unfortunately, we could have changed the PCI device attributes in 2.0. Now it'll be harder.

In the meantime, we warn if we ever meet to domain number >16bits.

Hopefully this only matters for virtual devices inside the VM where locality doesn't matter, hence it's not too bad if we don't show these devices.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=28ef241f05c20c2f04b349d43c615438f7e6f811

https://gitlab.freedesktop.org/xorg/lib/libpciaccess/commit/a167bd6474522a709ff3cbb00476c0e4309cb66f

https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/commit/?id=4186391b43bf101fda08e92185359c94812e2c3d https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/commit/?id=ab61451d47514c473953a24aa4f4f816b77ade56

Mar 11 '19 21:03 bgoglin

FWIW we have a bare metal machine (Dell 7920 Xeon Gold 6148) that for some reason (we're not sure why, may be some bios settings) is reporting pci domains of 0x10000 and 0x10001 (at least in CentOS7) which is causing hwloc to crash (in compare, finding identical devices).

Jan 21 '20 17:01 dylex

@dylex Which hwloc version are you using? 2.0.4 and 2.1.0 are supposed to just ignore such PCI domains. But I am not sure things should crash anyway. Can you post the lspci output?

Jan 21 '20 17:01 bgoglin

This was hwloc 1.11.8 and 1.11.9. I haven't tried 2. I just wanted to note that these can apparently show up in other cases. The specific crash was here: https://github.com/open-mpi/hwloc/blob/hwloc-1.11.9/src/pci-common.c#L125 Here's a partial lspci:

0000:00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
0000:00:1c.0 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #1 (rev f9)
0000:00:1c.7 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #8 (rev f9)
0000:00:1f.0 ISA bridge: Intel Corporation C621 Series Chipset LPC/eSPI Controller (rev 09)
0000:16:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
0000:72:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
10000:00:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
10000:00:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port C (rev 04)
10000:00:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D (rev 04)
10001:00:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
10001:00:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 04)

From the gdb it's comparing two Root Port A devices at the crash (though since the lower 16 bits are different, I'm note exactly sure why):

#4  0x00007f489f160c3e in hwloc_pci_compare_busids (b=<optimized out>, b=<optimized out>, a=<optimized out>, a=<optimized out>) at pci-common.c:125
#5  0x00007f489f160d40 in hwloc_pci_add_object (root=root@entry=0x7ffe1cc3fe30, new=0x55cb40ce1410) at pci-common.c:197
#6  0x00007f489f160f9c in hwloc_insert_pci_device_list (backend=0x55cb40bc8920, first_obj=0x55cb40ce1680) at pci-common.c:355
(gdb) print *current
$3 = {type = HWLOC_OBJ_BRIDGE, os_index = 4293918720, name = 0x55cb40ce0a10 "Intel Corporation Sky Lake-E PCI Express Root Port A", memory = {total_memory = 0, local_memory = 0, page_types_len = 0,
    page_types = 0x0}, attr = 0x55cb40ce08c0, depth = 0, logical_index = 0, os_level = -1, next_cousin = 0x0, prev_cousin = 0x0, parent = 0x7ffe1cc3fe30, sibling_rank = 0,
  next_sibling = 0x55cb40ce0a50, prev_sibling = 0x55cb40ce0530, arity = 0, children = 0x0, first_child = 0x55cb40ce0f70, last_child = 0x55cb40ce0f70, userdata = 0x0, cpuset = 0x0,
  complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0, nodeset = 0x0, complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x55cb40ce08f0,
  infos_count = 2, symmetric_subtree = 0}
(gdb) print *new
$4 = {type = HWLOC_OBJ_BRIDGE, os_index = 4293918720, name = 0x55cb40ce1640 "Intel Corporation Sky Lake-E PCI Express Root Port A", memory = {total_memory = 0, local_memory = 0, page_types_len = 0,
    page_types = 0x0}, attr = 0x55cb40ce1510, depth = 0, logical_index = 0, os_level = -1, next_cousin = 0x0, prev_cousin = 0x0, parent = 0x0, sibling_rank = 0, next_sibling = 0x55cb40ce1680,
  prev_sibling = 0x0, arity = 0, children = 0x0, first_child = 0x0, last_child = 0x0, userdata = 0x0, cpuset = 0x0, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0, nodeset = 0x0,
  complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x55cb40ce1540, infos_count = 2, symmetric_subtree = 0}
(gdb) print *current->attr
$5 = {cache = {size = 433471464134475775, depth = 540049542, linesize = 0, associativity = 4, type = HWLOC_OBJ_CACHE_UNIFIED}, group = {depth = 65535}, pcidev = {domain = 65535, bus = 0 '\000',
    dev = 0 '\000', func = 0 '\000', class_id = 1540, vendor_id = 32902, device_id = 8240, subvendor_id = 0, subdevice_id = 0, revision = 4 '\004', linkspeed = 0}, bridge = {upstream = {pci = {
        domain = 65535, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 1540, vendor_id = 32902, device_id = 8240, subvendor_id = 0, subdevice_id = 0, revision = 4 '\004', linkspeed = 0}},
    upstream_type = HWLOC_OBJ_BRIDGE_PCI, downstream = {pci = {domain = 65535, secondary_bus = 1 '\001', subordinate_bus = 1 '\001'}}, downstream_type = HWLOC_OBJ_BRIDGE_PCI, depth = 0}, osdev = {
    type = 65535}}
(gdb) print *new->attr
$6 = {cache = {size = 433471464134475775, depth = 540049542, linesize = 0, associativity = 4, type = HWLOC_OBJ_CACHE_UNIFIED}, group = {depth = 65535}, pcidev = {domain = 65535, bus = 0 '\000',
    dev = 0 '\000', func = 0 '\000', class_id = 1540, vendor_id = 32902, device_id = 8240, subvendor_id = 0, subdevice_id = 0, revision = 4 '\004', linkspeed = 0}, bridge = {upstream = {pci = {
        domain = 65535, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 1540, vendor_id = 32902, device_id = 8240, subvendor_id = 0, subdevice_id = 0, revision = 4 '\004', linkspeed = 0}},
    upstream_type = HWLOC_OBJ_BRIDGE_PCI, downstream = {pci = {domain = 65535, secondary_bus = 1 '\001', subordinate_bus = 1 '\001'}}, downstream_type = HWLOC_OBJ_BRIDGE_PCI, depth = 0}, osdev = {
    type = 65535}}

Jan 21 '20 18:01 dylex

Thanks, it's crashing because the Root ports entire PCI bus ids are identical once the domain is shortened to 16bits (0x10000 = 0x0), which is considered a buggy PCI report by hwloc. 2.0.4 and 2.1 should avoid the crash by ignoring the 0x10000 domain entirely. Now I need to ask Dell and/or Intel what this secondary domain is about.

Do these command report something different? I am trying to see if there could be one domain per socket or per SNC or something like that.

$ cat /sys/bus/pci/devices/0*:*/local_cpus
$ cat /sys/bus/pci/devices/1*:*/local_cpus

Jan 21 '20 18:01 bgoglin

There are 2 sockets, 2 NUMA domains. All of the 0x0 and 0x10000 domain bridges are NUMA node 0: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0fffff00,000fffff while the 0x10001 bridges are NUMA node 1: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000ffff,f00000ff,fff00000 However, there are other (non-bridge) devices (from 0000:a0:04.0 through 0000:d4:17.0) that also claim to be on NUMA 1. There is one actual device that shows up on the 0x10000 domain, an nvme: 10000:01:00.0 Non-Volatile memory controller: Toshiba America Info Systems Device 0116 I can attach a complete lspci -v and dmidecode if that's helpful. We're also confused by this since other machines we thought were identical only have domain 0. I'm about to upgrade the BIOS on this machine and check the BIOS settings for anything weird as well.

Jan 21 '20 18:01 dylex

Thanks, the full dmidecode and lspci outputs wouldn't help much. I don't know if domain IDs are assigned by the BIOS or not. I am going to query Dell about this. I can't find any relevant information about PCI domains in the revisions of your BIOS.

Jan 21 '20 21:01 bgoglin

After some poking around, it turns out these were coming from VMD Intel's Volume Management Device that creates some virtual pci bus hanging off of a storage controller. Completely disabling VMD in BIOS made them disappear.

Jan 22 '20 20:01 dylex

Thanks for the feedback. We cannot fix 32bit domain support without breaking the ABI unfortunately so I am kind of happy that your case is related to something likely uncommon :/

Jan 23 '20 13:01 bgoglin

@dylex If you get a chance, I'd like to get the tarball generated by hwloc-gather-topology with --io on your machine with VMD enabled again. Sorry I should have asked earlier, but we figured there could be a slightly better workaround for this issue. Unfortunately we couldn't find a way to get such a PCI domain on a real machine or on a VM. Not sure you'd be able to post the tarball here, if so, send it to me directly.

Mar 05 '20 08:03 bgoglin

@bgoglin I haven't been able to get that exact machine since it's in use and we'd have to change the BIOS, but I found another machine with those same VMD devices enabled. However, this machine doesn't have any conflicts in the truncated IDs (so things don't crash), but otherwise looks similar. If this is useful I can email you the dumps.

Mar 09 '20 20:03 dylex

Yes please send it to me (brice.goglin at inria.fr). Thanks!

Mar 09 '20 20:03 bgoglin

hwloc 2.7.0 - Dell Latitude 3420

$ hwloc-info 
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
depth 0:           1 Machine (type #0)
 depth 1:          1 Package (type #1)
  depth 2:         1 L3Cache (type #6)
   depth 3:        4 L2Cache (type #5)
    depth 4:       4 L1dCache (type #4)
     depth 5:      4 L1iCache (type #9)
      depth 6:     4 Core (type #2)
       depth 7:    8 PU (type #3)
Special depth -3:  1 NUMANode (type #13)
Special depth -4:  2 Bridge (type #14)
Special depth -5:  4 PCIDev (type #15)
Special depth -6:  3 OSDev (type #16)

Some devices are reporting PCI domains of 0x10000

$ cat /sys/bus/pci/devices/0*:*/local_cpus                                                                               
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff
ff

$ cat /sys/bus/pci/devices/1*:*/local_cpus
ff
ff
ff
ff

Is the only solution to disable VMD in the BIOS?

Mar 02 '22 17:03 noraj

If you care about those devices, the solution is either to disable VMD in the BIOS, or to rebuild hwloc with 32bits PCI domains (and rebuild apps that use the hwloc API). If you don't care about I/O locality, just ignore the warning (passing HWLOC_HIDE_ERRORS=2 in the environment should hide it).

If laptops also start showing this warning, I need to seriously think about releasing a hwloc 3.0 to fix this properly.

Mar 02 '22 18:03 bgoglin

My use case is that john seems to use hwloc (to detect CPU/GPU I guess) and display this warning at each command. I added the environment variable into my ~/.zshrc to effectively hide those warning. Thanks.

Mar 03 '22 10:03 noraj

We are starting to get a good amount of messages from our users who run our "make check" worried about these messages; these are users who do not know about hwloc and barely know about MPI (that uses hwloc).

Would it be possible for you to rethink using a printed warning message to stdout about this situation from your library since it is now so common. And the printed message is of no use to presumably almost all users.

Jul 21 '22 15:07 BarrySmith

The current strategy is that most messages that are not critical are hidden by default (but displayed in lstopo), while the critical ones are shown by default. That's why we have HWLOC_HIDE_ERRORS=0 (lstopo, show all errors), =1 (default, what you have), and =2 (show nothing).

This was actually modified because of CUDA warnings flooding users like yours. CUDA warnings are now considered non-critical, because missing some GPU doesn't prevent hwloc from working. I think we could do the same for your PCI errors (some PCI devices would be missing from the hwloc topology, not a critical problem). I'll double-check and update the code if I don't find any problem .

FYI, we'll fix this mess properly in hwloc 3.0 but it won't occur before one year or so.

Jul 21 '22 15:07 bgoglin

I demoted 32bits PCI domain errors to non-critical in master (and in 2.8 branch in case we ever release a 2.8.1).

Jul 25 '22 11:07 bgoglin

hwloc hwloc copied to clipboard

32bit domain support

hwloc
hwloc copied to clipboard