hwloc
hwloc copied to clipboard
hwloc: Add weighted interleave support
Since a new memory policy MPOL_WEIGHTED_INTERLEAVE is added at [1], hwloc needs to support this flag.
This new flag is expected to be released from linux-v6.9.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa3bea4e1f8202d787709b7e3654eb0a99aed758 Signed-off-by: Honggyu Kim [email protected]
Hi, this is my first time contribution to this project so please let me know if there is anything to fix. Thanks.
Hello. Thanks for the reminder. I followed early versions of these patches but forgot about it when it became close to ready for inclusion. Given that this interface is supposed to interleave in a more clever way, should we use it by default for hwloc's interleave policy when supported? As long as the kernel doesn't set buggy weights on nodes, it should work fine, right?
Hi @bgoglin, thanks for the quick response.
should we use it by default for hwloc's interleave policy when supported?
IMHO, MPOL_WEIGHTED_INTERLEAVE can do the same behavior as MPOL_INTERLEAVE when the weight ratio is set to 1 for all the nodes. But this flag is yet another new flag so it's better not to change the default policy.
As long as the kernel doesn't set buggy weights on nodes, it should work fine, right?
The weight values are set to 1 for all the nodes by default unless a system admin changes the values at /sys/kernel/mm/mempolicy/weighted_interleave/node*, which was introduced by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dce41f5ae2539d1c20ae8de4e039630aec3c3f3c.
And yeah, it would work fine when weights have sane values.
When MPOL_PREFERRED_MANY was added, hwloc just used it instead of MPOL_PREFERRED because it was supposedly better in all cases. The situation isn't exactly the same here, but we could still envision using WEIGHTED by default in hwloc since it doesn't break the current specification of hwloc's INTERLEAVE flag (we could also switch back to the old MPOL_INTERLEAVE when the HWLOC_MEMBIND_STRICT flag is given). I need to think more about it. One thing to keep in mind is the future extension of the mbind() syscall for custom interleaving. This won't fit in hwloc's current API at all, we may have to add a Linux-specific hwloc binding call. I'd like to see better this whole picture before take a decision on this new WEIGHTED flag alone (but I still want to make it in hwloc 2.11 in the near future).
Hi @bgoglin,
When MPOL_PREFERRED_MANY was added, hwloc just used it instead of MPOL_PREFERRED because it was supposedly better in all cases.
Do you mean by 6abf03dd20caa2e3d0231ac0fcebf43d6acdd294?
The situation isn't exactly the same here, but we could still envision using WEIGHTED by default in hwloc since it doesn't break the current specification of hwloc's INTERLEAVE flag (we could also switch back to the old MPOL_INTERLEAVE when the HWLOC_MEMBIND_STRICT flag is given). I need to think more about it.
MPOL_WEIGHTED_INTERLEAVE works with a single global policy written at sysfs. If this isn't a problem in hwloc, then I think it can be fine to make it for default interleave policy. But I don't have much experience on hwloc project yet so please take your time for the further consideration before making decision.
One thing to keep in mind is the future extension of the mbind() syscall for custom interleaving. This won't fit in hwloc's current API at all, we may have to add a Linux-specific hwloc binding call. I'd like to see better this whole picture before take a decision on this new WEIGHTED flag alone (but I still want to make it in hwloc 2.11 in the near future).
It looks like you mean vma range weighted interleaving. It was previously suggested at https://lore.kernel.org/linux-mm/[email protected] with a mbind2 syscall but it requires to convince more maintainers why this is needed with more practical use cases. I think it won't be accepted anytime soon so we don't have to worry about it for now.
Hello. I need to merge this and release hwloc 2.11. I think I finally decided to keep your additional WEIGHTED_INTERLEAVE flag instead of merging it into the existing INTERLEAVE. I am going to add "support" flag saying whether WEIGHTED_INTERLEAVE is supported on the current system. In practice it will be set on Linux and unset everywhere else. But it would be nice to detect at runtime whether the kernel really supports it. Do you see a way to do that without calling mbind() on a dummy page? Is there a new sysfs file that exposes node weights for instance?
I pushed additional changes in https://github.com/bgoglin/hwloc/commits/honggyukim-weighted-interleave/ The first one are minor changes that need to be added to your commit. The second and last one can remain separated. Feel free to integrate them in your PR or I'll merge my branch directly.
Hi @bgoglin,
Hello. I need to merge this and release hwloc 2.11. I think I finally decided to keep your additional WEIGHTED_INTERLEAVE flag instead of merging it into the existing INTERLEAVE.
Glad to hear that :)
But it would be nice to detect at runtime whether the kernel really supports it. Do you see a way to do that without calling mbind() on a dummy page? Is there a new sysfs file that exposes node weights for instance?
Yeah, the new sysfs interface is created at /sys/kernel/mm/mempolicy/weighted_interleave so you can check whether the system has this directory to check weighted interleaving feature. Inside the directory, there are multiple nodeN files that contain numerical weight values.
$ ls /sys/kernel/mm/mempolicy/weighted_interleave
node0 node1 node2 node3
I pushed additional changes in https://github.com/bgoglin/hwloc/commits/honggyukim-weighted-interleave/ The first one are minor changes that need to be added to your commit. The second and last one can remain separated. Feel free to integrate them in your PR or I'll merge my branch directly.
Thanks, I will integrate your patches into my PR by tomorrow.
Yeah, the new sysfs interface is created at
/sys/kernel/mm/mempolicy/weighted_interleaveso you can check whether the system has this directory to check weighted interleaving feature. Inside the directory, there are multiplenodeNfiles that contain numerical weight values.
Thanks, I added this to my commits.
Hi @bgoglin, thanks for waiting. I've just slightly modified commit messages and pushed your patches into this PR after rebase onto the current master. Please have a look. Thanks!
Thanks, I am backporting this into the v2.x branch and will release 2.11 with it, likely next week.
Thanks, I am backporting this into the v2.x branch and will release 2.11 with it, likely next week.
Thanks. That would be great!
I am posting 2.11rc1 right now with thes changes.
That's great! Thanks for the update.