mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/ofi: fix case for forcing packing of gpu buffers in netmod pt2pt path

Open abrooks98 opened this issue 1 year ago • 3 comments

Pull Request Description

This PR fixes the case for when GPU packing should be done in the netmod pt2pt path. It was meant to be captured in #6433, but was missed.

In this code path, we must enable packing for any GPU buffer type. For ZE, this includes device, USM host, and USM shared allocations. The last 2 cases were missing. Originally, we implemented this check as follows

        if (data_sz &&
            (attr.type == MPL_GPU_POINTER_DEV || attr.type == MPL_GPU_POINTER_MANAGED ||
             attr.type == MPL_GPU_POINTER_REGISTERED_HOST)) {
            ...

However, since we have the utility MPL_gpu_query_pointer_is_dev which performs this same check, I opted for that solution instead. This is more flexible; if a backend does not treat USM host or USM managed as device memory, then they can still pass this check correctly if the corresponding MPL_gpu_query_pointer_is_dev is correctly implemented.

Fixes #7030

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

abrooks98 avatar Jun 26 '24 22:06 abrooks98

test:mpich/ch4/gpu/ofi test:mpich/ch4/ofi

abrooks98 avatar Jun 26 '24 22:06 abrooks98

After rebuilding MPICH with and without this patch, I was unable to reproduce the crash using either the 4.2.x branch or main unless I also enabled MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 in the environment. With that env setting, MPICH will still crash even with this patch.

Only drop20231026 crashes in a default environment, so perhaps #6433 was sufficient to fix the issue. Still, this patch is probably OK to merge.

raffenet avatar Jun 27 '24 17:06 raffenet

After rebuilding MPICH with and without this patch, I was unable to reproduce the crash using either the 4.2.x branch or main unless I also enabled MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 in the environment. With that env setting, MPICH will still crash even with this patch.

Only drop20231026 crashes in a default environment, so perhaps #6433 was sufficient to fix the issue. Still, this patch is probably OK to merge.

Thanks for confirming. Let me discuss with the team and look into the issue regarding MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1. To clarify, the crash happens with the latest reproducer you posted in #7030 while setting the CVAR?

It's also a little strange that drop20231026 crashes in the default environment and the others do not.. The drop should have #6433 included as well, but I'll take a look at other differences in this path between the drop and main

abrooks98 avatar Jun 27 '24 17:06 abrooks98

test:mpich/ch4/gpu

raffenet avatar Aug 16 '24 17:08 raffenet

test:mpich/ch4/gpu

raffenet avatar Aug 16 '24 19:08 raffenet

@abrooks98 [52b9083] can probably be squashed with your first commit if you agree. Please review [2ccda83] in addition. I think it helps clean up the logic of the device buffer inject path.

raffenet avatar Aug 16 '24 19:08 raffenet

Thanks @raffenet for the additions; they look good. Let's squash the two you suggested

abrooks98 avatar Aug 16 '24 19:08 abrooks98

Thanks @raffenet for the additions; they look good. Let's squash the two you suggested

Will squash and merge once tests are complete. Thanks!

raffenet avatar Aug 16 '24 20:08 raffenet