ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/API: Introduce migratable_mem_types

Open Akshay-Venkatesh opened this issue 1 year ago • 2 comments

What/Why ?

Introduce migratable_mem_types field that is populated by MDs. These mem_types would include the list of the memory types whose pages can migrate between host and PCIe device memory. When the MDs identify specific allocation types are migratable, UCP can add such memory types to a list which would be candidates for nonblocking registration (for example through ODP under IB transport). Today nonblocking registration is applied to memory types listed in reg_nb_mem_types field of UCP context's external config variable but with MDs identifying which memory types migrate, the decision is dynamically made by MDs.

How ?

Instead of having the user list memory types which need to go through on-demand registration for pages that migrate, instead users have decide if migratable memory should be registered (default is to let the runtime decide based on the platform) using UCX_REG_MIGRATABLE_MEM env var. UCX_REG_MIGRATABLE_MEM=off would imply that none of the memory types would undergo non-blocking registration; UCX_REG_MIGRATABLE_MEM=on would imply that all memory types that can be registered without explicit pinning will be considered.

Next steps

This PR doesn't add capability in different MDs (like cuda_copy) to detect the specific system and populate migratable_mem_types. This can be done potentially using capabilities added in https://github.com/openucx/ucx/pull/9314 or by directly making query calls in the specific MD in a follow up PR.

Akshay-Venkatesh avatar Sep 13 '23 17:09 Akshay-Venkatesh

@yosefe / @brminich

This is a related error where gtest is trying to force use of ODP for host memory. What would be the equivalent of doing this with "REG_MIGRATABLE_MEM" ? Should we introduce config env in IB, CUDA, other MDs to specify migratable memory types so that each of these MDs can populate their md_attribute field of migratable_mem_types with a list of memory types? This way when we use UCX_REG_MIGRATABLE_MEM=on and say UCX_IB_MIGRATABLE_MEM_TYPES=host, then ODP can be forced for host memory.

2023-09-13T18:00:54.9106591Z [----------] 1 test from shm_ib/test_ucp_rma_reg_nb
2023-09-13T18:00:54.9107209Z [ RUN      ] shm_ib/test_ucp_rma_reg_nb.put_blocking/0 <shm,ib>
2023-09-13T18:00:54.9108045Z /scrap/azure/agent-08/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test.cc:121: Failure
2023-09-13T18:00:54.9109097Z Invalid UCS configuration for REG_NONBLOCK_MEM_TYPES : host, error message: No such element(-12)
2023-09-13T18:00:54.9109718Z [  FAILED  ] shm_ib/test_ucp_rma_reg_nb.put_blocking/0, where GetParam() = shm,ib (0 ms)
2023-09-13T18:00:54.9110910Z [----------] 1 test from shm_ib/test_ucp_rma_reg_nb (0 ms total)

Akshay-Venkatesh avatar Sep 13 '23 18:09 Akshay-Venkatesh

@yosefe / @brminich

This is a related error where gtest is trying to force use of ODP for host memory. What would be the equivalent of doing this with "REG_MIGRATABLE_MEM" ? Should we introduce config env in IB, CUDA, other MDs to specify migratable memory types so that each of these MDs can populate their md_attribute field of migratable_mem_types with a list of memory types? This way when we use UCX_REG_MIGRATABLE_MEM=on and say UCX_IB_MIGRATABLE_MEM_TYPES=host, then ODP can be forced for host memory.

2023-09-13T18:00:54.9106591Z [----------] 1 test from shm_ib/test_ucp_rma_reg_nb
2023-09-13T18:00:54.9107209Z [ RUN      ] shm_ib/test_ucp_rma_reg_nb.put_blocking/0 <shm,ib>
2023-09-13T18:00:54.9108045Z /scrap/azure/agent-08/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test.cc:121: Failure
2023-09-13T18:00:54.9109097Z Invalid UCS configuration for REG_NONBLOCK_MEM_TYPES : host, error message: No such element(-12)
2023-09-13T18:00:54.9109718Z [  FAILED  ] shm_ib/test_ucp_rma_reg_nb.put_blocking/0, where GetParam() = shm,ib (0 ms)
2023-09-13T18:00:54.9110910Z [----------] 1 test from shm_ib/test_ucp_rma_reg_nb (0 ms total)

Going forward with this change for now to see if tests pass. Will revert if there is a better approach to address the issue.

Akshay-Venkatesh avatar Sep 13 '23 18:09 Akshay-Venkatesh