hwloc icon indicating copy to clipboard operation
hwloc copied to clipboard

hwloc: Add weighted interleave support

Open honggyukim opened this issue 1 year ago • 5 comments

Since a new memory policy MPOL_WEIGHTED_INTERLEAVE is added at [1], hwloc needs to support this flag.

This new flag is expected to be released from linux-v6.9.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa3bea4e1f8202d787709b7e3654eb0a99aed758 Signed-off-by: Honggyu Kim [email protected]

honggyukim avatar Apr 19 '24 22:04 honggyukim

Hi, this is my first time contribution to this project so please let me know if there is anything to fix. Thanks.

honggyukim avatar Apr 19 '24 22:04 honggyukim

Hello. Thanks for the reminder. I followed early versions of these patches but forgot about it when it became close to ready for inclusion. Given that this interface is supposed to interleave in a more clever way, should we use it by default for hwloc's interleave policy when supported? As long as the kernel doesn't set buggy weights on nodes, it should work fine, right?

bgoglin avatar Apr 20 '24 18:04 bgoglin

Hi @bgoglin, thanks for the quick response.

should we use it by default for hwloc's interleave policy when supported?

IMHO, MPOL_WEIGHTED_INTERLEAVE can do the same behavior as MPOL_INTERLEAVE when the weight ratio is set to 1 for all the nodes. But this flag is yet another new flag so it's better not to change the default policy.

As long as the kernel doesn't set buggy weights on nodes, it should work fine, right?

The weight values are set to 1 for all the nodes by default unless a system admin changes the values at /sys/kernel/mm/mempolicy/weighted_interleave/node*, which was introduced by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dce41f5ae2539d1c20ae8de4e039630aec3c3f3c.

And yeah, it would work fine when weights have sane values.

honggyukim avatar Apr 21 '24 08:04 honggyukim

When MPOL_PREFERRED_MANY was added, hwloc just used it instead of MPOL_PREFERRED because it was supposedly better in all cases. The situation isn't exactly the same here, but we could still envision using WEIGHTED by default in hwloc since it doesn't break the current specification of hwloc's INTERLEAVE flag (we could also switch back to the old MPOL_INTERLEAVE when the HWLOC_MEMBIND_STRICT flag is given). I need to think more about it. One thing to keep in mind is the future extension of the mbind() syscall for custom interleaving. This won't fit in hwloc's current API at all, we may have to add a Linux-specific hwloc binding call. I'd like to see better this whole picture before take a decision on this new WEIGHTED flag alone (but I still want to make it in hwloc 2.11 in the near future).

bgoglin avatar Apr 24 '24 08:04 bgoglin

Hi @bgoglin,

When MPOL_PREFERRED_MANY was added, hwloc just used it instead of MPOL_PREFERRED because it was supposedly better in all cases.

Do you mean by 6abf03dd20caa2e3d0231ac0fcebf43d6acdd294?

The situation isn't exactly the same here, but we could still envision using WEIGHTED by default in hwloc since it doesn't break the current specification of hwloc's INTERLEAVE flag (we could also switch back to the old MPOL_INTERLEAVE when the HWLOC_MEMBIND_STRICT flag is given). I need to think more about it.

MPOL_WEIGHTED_INTERLEAVE works with a single global policy written at sysfs. If this isn't a problem in hwloc, then I think it can be fine to make it for default interleave policy. But I don't have much experience on hwloc project yet so please take your time for the further consideration before making decision.

One thing to keep in mind is the future extension of the mbind() syscall for custom interleaving. This won't fit in hwloc's current API at all, we may have to add a Linux-specific hwloc binding call. I'd like to see better this whole picture before take a decision on this new WEIGHTED flag alone (but I still want to make it in hwloc 2.11 in the near future).

It looks like you mean vma range weighted interleaving. It was previously suggested at https://lore.kernel.org/linux-mm/[email protected] with a mbind2 syscall but it requires to convince more maintainers why this is needed with more practical use cases. I think it won't be accepted anytime soon so we don't have to worry about it for now.

honggyukim avatar Apr 26 '24 05:04 honggyukim

Hello. I need to merge this and release hwloc 2.11. I think I finally decided to keep your additional WEIGHTED_INTERLEAVE flag instead of merging it into the existing INTERLEAVE. I am going to add "support" flag saying whether WEIGHTED_INTERLEAVE is supported on the current system. In practice it will be set on Linux and unset everywhere else. But it would be nice to detect at runtime whether the kernel really supports it. Do you see a way to do that without calling mbind() on a dummy page? Is there a new sysfs file that exposes node weights for instance?

bgoglin avatar Jun 12 '24 12:06 bgoglin

I pushed additional changes in https://github.com/bgoglin/hwloc/commits/honggyukim-weighted-interleave/ The first one are minor changes that need to be added to your commit. The second and last one can remain separated. Feel free to integrate them in your PR or I'll merge my branch directly.

bgoglin avatar Jun 12 '24 15:06 bgoglin

Hi @bgoglin,

Hello. I need to merge this and release hwloc 2.11. I think I finally decided to keep your additional WEIGHTED_INTERLEAVE flag instead of merging it into the existing INTERLEAVE.

Glad to hear that :)

But it would be nice to detect at runtime whether the kernel really supports it. Do you see a way to do that without calling mbind() on a dummy page? Is there a new sysfs file that exposes node weights for instance?

Yeah, the new sysfs interface is created at /sys/kernel/mm/mempolicy/weighted_interleave so you can check whether the system has this directory to check weighted interleaving feature. Inside the directory, there are multiple nodeN files that contain numerical weight values.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave
node0  node1  node2  node3

I pushed additional changes in https://github.com/bgoglin/hwloc/commits/honggyukim-weighted-interleave/ The first one are minor changes that need to be added to your commit. The second and last one can remain separated. Feel free to integrate them in your PR or I'll merge my branch directly.

Thanks, I will integrate your patches into my PR by tomorrow.

honggyukim avatar Jun 13 '24 01:06 honggyukim

Yeah, the new sysfs interface is created at /sys/kernel/mm/mempolicy/weighted_interleave so you can check whether the system has this directory to check weighted interleaving feature. Inside the directory, there are multiple nodeN files that contain numerical weight values.

Thanks, I added this to my commits.

bgoglin avatar Jun 13 '24 07:06 bgoglin

Hi @bgoglin, thanks for waiting. I've just slightly modified commit messages and pushed your patches into this PR after rebase onto the current master. Please have a look. Thanks!

honggyukim avatar Jun 13 '24 23:06 honggyukim

Thanks, I am backporting this into the v2.x branch and will release 2.11 with it, likely next week.

bgoglin avatar Jun 14 '24 07:06 bgoglin

Thanks, I am backporting this into the v2.x branch and will release 2.11 with it, likely next week.

Thanks. That would be great!

honggyukim avatar Jun 14 '24 07:06 honggyukim

I am posting 2.11rc1 right now with thes changes.

bgoglin avatar Jun 17 '24 10:06 bgoglin

That's great! Thanks for the update.

honggyukim avatar Jun 18 '24 04:06 honggyukim