ch4/ofi: fix case for forcing packing of gpu buffers in netmod pt2pt path
Pull Request Description
This PR fixes the case for when GPU packing should be done in the netmod pt2pt path. It was meant to be captured in #6433, but was missed.
In this code path, we must enable packing for any GPU buffer type. For ZE, this includes device, USM host, and USM shared allocations. The last 2 cases were missing. Originally, we implemented this check as follows
if (data_sz &&
(attr.type == MPL_GPU_POINTER_DEV || attr.type == MPL_GPU_POINTER_MANAGED ||
attr.type == MPL_GPU_POINTER_REGISTERED_HOST)) {
...
However, since we have the utility MPL_gpu_query_pointer_is_dev which performs this same check, I opted for that solution instead. This is more flexible; if a backend does not treat USM host or USM managed as device memory, then they can still pass this check correctly if the corresponding MPL_gpu_query_pointer_is_dev is correctly implemented.
Fixes #7030
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/gpu/ofi test:mpich/ch4/ofi
After rebuilding MPICH with and without this patch, I was unable to reproduce the crash using either the 4.2.x branch or main unless I also enabled MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 in the environment. With that env setting, MPICH will still crash even with this patch.
Only drop20231026 crashes in a default environment, so perhaps #6433 was sufficient to fix the issue. Still, this patch is probably OK to merge.
After rebuilding MPICH with and without this patch, I was unable to reproduce the crash using either the
4.2.xbranch ormainunless I also enabledMPIR_CVAR_CH4_OFI_ENABLE_HMEM=1in the environment. With that env setting, MPICH will still crash even with this patch.Only drop20231026 crashes in a default environment, so perhaps #6433 was sufficient to fix the issue. Still, this patch is probably OK to merge.
Thanks for confirming. Let me discuss with the team and look into the issue regarding MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1. To clarify, the crash happens with the latest reproducer you posted in #7030 while setting the CVAR?
It's also a little strange that drop20231026 crashes in the default environment and the others do not.. The drop should have #6433 included as well, but I'll take a look at other differences in this path between the drop and main
test:mpich/ch4/gpu
test:mpich/ch4/gpu
@abrooks98 [52b9083] can probably be squashed with your first commit if you agree. Please review [2ccda83] in addition. I think it helps clean up the logic of the device buffer inject path.
Thanks @raffenet for the additions; they look good. Let's squash the two you suggested
Thanks @raffenet for the additions; they look good. Let's squash the two you suggested
Will squash and merge once tests are complete. Thanks!