hwloc
hwloc copied to clipboard
[RFC] Add support for exporting latency, bandwidth topology through calibration
Currently, hwloc can export hardware and network locality for applications to obtain and set their affinity. However, in many scenarios, the information provided by the topology is not enough, for example, it cannot reflect the actual memory latency and bandwidth data between different schedule domain. We hope to provide more detailed and precise information of HW capabilities in hwloc by adding several new calibration tools, so that application can achieve a more refined design to achieve higher performance and fully tap the capabilities of the HW.
We mainly focus on exposing memory/bus bandwidth, cache coherence/bus communication latency etc to users. Those topology information has neither standard ACPI nor dts interface to export, but they can be beneficial of user applications. Some examples,
- the memory bandwidth while we spread tasks between multiple clusters vs. gather them in one cluster
- the memory bandwidth while we spread tasks between multiple NUMA nodes vs. gather them in one NUMA
- the cache synchronization latency while we spread tasks between multiple clusters vs. gather them in one cluster
- the cache synchronization latency while we spread tasks between multiple NUMA nodes vs. gather them in one NUMA node
- bus bandwidth and congestion in complex topology, for example, for the below topology node 1 - node0 - node2 - node3 the bus between node0 and node2 might become bottleneck as the communications between node1 and node3 also depend on it. numa distance can't describe this kind of complex bus topology at all.
- I/O bandwidth and latency while we access I/O devices such as accelerators, networks, storages from the NUMA node which devices belong to vs. from different NUMA nodes. ...
If possible, we also can export more such as IPC bandwidth and latency(for example, pipe), spinlock/mutex latency etc. Calibration tools will provide these data about different entities at some certain topology levels so that application could select the spreading and gathering strategy of threads according to this data.
The design of the calibration tool will be similar to netloc. Three steps are required to use the calibration tool.
The first step is to get data about system bandwidth, latency, etc by running some benchmark tests since the standard operating system does not support providing this information. The raw data will be saved in files. This step may need to be performed by a privilege user.
The second step is to convert the original file generated in the previous step into a file in a readable format by the calibration tool. No privileges are required for this step.
In the third step, the application could obtain the calibration information of the system through a C APIs exposed by calibration tool and hwloc commands can be also extended to show these new information. The source of the calibration data is the readable file generated in the second step. E.g. hwloc_get_mem_bandwidth(hwloc_topology_t topology, unsigned idx1, unsigned idx2) could be used to get the memory bandwidth ability between idx1 and idx2 in some topology type.
You're describing very different use cases. If the current APIs aren't enough, you may propose a new one, but I don't see how to design one that would tackle all these. I wonder if you'll end with random key/value pairs... Do you have an API in mind?
Anyway, there are already some ways to expose some of these performance metrics in the hwloc API.
-
memory attribute API in hwloc/memattrs.h: It currently reads latency and bandwidth from the ACPI HMAT table when available, but it's user-extendable to expose whatever custom metrics from whatever initiator object (usually a set of cores but could be an I/O object too) to whatever target object (usually a NUMA node but we could allow caches too). This should cover all cases where "initiator" and "target" are relevant, like your cases (1), (2) and (6)
-
the distance API in hwloc/distances.h for cases without any initiator/target: It's usually an array of bandwidths between NUMA nodes or GPUs but it can be latency or anything else, between random objects. It may describe your (5), both latency and bandwidth, but without explicit information about congestion.
For both APIs, users may already specify a new metric with a name describing what it does, for instance "BandwidthWhenAllCoresFromClusterReadTwiceAndStoreOnceInSameBuffer".
Both these API can used in C (and at least partially on the command-line) to add performance info to a hwloc topology in XML. Then you load the XML at runtime instead of discovering the native topology.
Anyway, we'd need actually users. Netloc failed because users said they wanted lots of information like you do, but they didn't know what to do in practice with the information at runtime. The more information you have, the more difficult taking placement decision becomes.
For instance I don't know what I would do with information from your (5). I already know I need to keep related things together, I already have the distance matrix to tell me that node0 and node2 are close while node1 and node3 are far away. Having info about the congested bandwidth would tell me how bad performance would be if I don't place things properly, but that wouldn't change my placement algorithm.
By the way, netloc never had a calibration tool, and it's not clear we want a calibration tool inside hwloc, but having ways to add performance information to hwloc from external tools is very easy.
Thank you, the two sets of HWLOC APIs you mentioned are indeed enough to describe the information we need.
We currently have no idea how to use this information. Overall, We hope to first let users establish this awareness, and then use this information to do something on this basis. For some people, they may have no concept of these problems before, and this way they can understand the problem and think about how to make better use of our HW.
It seems that adding HMAT to ACPI is a similar idea, which gives hardware manufacturers a channel to provide the system with latency and bandwidth information of the system topology.
We are now designing some calibration tools. The current solution is to import the data we get into HWLOC and display it through lstopo. But in the end, we still hope that this tool can be integrated into hwloc, so that can be queried natively in HWLOC, and more people can easily obtain these system information, which makes it easier to achieve our first goal.