Alex Brooks

Results 51 comments of Alex Brooks

@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able...

It seems we missed the idea that users may choose to skip using certain devices. In this case, the assertion is incorrect. I'm checking to see if it is sufficient...

My previous comment is incorrect. The assertion is still valid in case of skipping certain devices. `local_ze_device_count` includes root and subdevices. So in case of `ZE_AFFINITY_MASK=1`, the correct value is...

It turns out the issue of handling whole devices in `ZE_AFFINITY_MASK` is relatively new and stems from #6929. The change causes comparing an unsigned int against an int with value...

Thanks for pointing this out. I will try to find access to a Flex series GPU and continue investigating this issue.

@hzhou can you rebase this on `main` and resolve conflicts? Now that some GPU fixes are in to resolve memory issues, we are hoping this will now fix #7118

@colleeneb are you able to re-test with 7202 (newly rebased on main) and with HMEM on? Now that some other GPU-related fixes are in main, we want to see the...

> Hello, I was trying to build this based off the instructions for Aurora here: https://github.com/pmodels/mpich/wiki/Using-MPICH-on-Aurora@ALCF since we were thinking it fixed a memory leak we saw and we wanted...

> @zippylab tested this out (using the build above) and unfortunately saw the same memory growth as he described in #6959 . Is there anything missing in the build above,...

> @abrooks98 As discussed with you and Renzo Bustamante, there's no small reproducer for this, and it's far from trivial to construct one because it's an alltoallv() pattern and only...