ucx
ucx copied to clipboard
UCT/API/V2: Introduce md_query_v2
What
In preparation for https://github.com/openucx/ucx/pull/7847 being broken into separate PRs, introduce md_query_v2 in this PR.
@yosefe the failing test complains about not finding cuda transport on what seems like a machine without a GPU. Should such a test be running?
@yosefe would you mind restarting the two failing tests?
@yosefe I made edits to the PR to separate md_query_v2 and md_query API. Can you review when possible?
@dmitrygx @brminich can you pls review?
@dmitrygx @brminich the following doesn't seem related:
2022-08-23T16:39:07.3474380Z [ RUN ] rcx/test_ucp_am_nbx_seg_size.multi/0 <rc_x>
2022-08-23T16:39:07.6539640Z [ INFO ] seg size 1024 data size 2048
2022-08-23T16:39:07.6609112Z [1661272747.660634] [swx-rain04:334083:0] wireup.c:404 UCX ERROR ep 0x7f8069f7b000: no remote ep address for lane[2]->remote_lane[2]
2022-08-23T16:39:17.6795627Z [1661272757.679247] [swx-rain04:334083:0] ucp_test.cc:301 UCX ERROR request 0x4655050 completed with error Destination is unreachable
2022-08-23T16:39:17.6797646Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/test_ucp_am.cc:489: Failure
2022-08-23T16:39:17.6798651Z Value of: m_am_received
2022-08-23T16:39:17.6799213Z Actual: false
2022-08-23T16:39:17.6799511Z Expected: true
2022-08-23T16:39:17.6800637Z [1661272757.679334] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 2
2022-08-23T16:39:17.6801830Z [1661272757.679340] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 1
2022-08-23T16:39:17.6803004Z [1661272757.679345] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 0
2022-08-23T16:40:17.8255445Z [1661272817.825240] [swx-rain04:334083:0] ucp_test.cc:296 UCX ERROR request 0x4644010 did not complete on time
2022-08-23T16:40:17.8462005Z [1661272817.845928] [swx-rain04:334083:0] mpool.c:55 UCX WARN object 0x4643f00 {flags:0x42 <no debug info>} was not returned to mpool ucp_requests
2022-08-23T16:40:17.8590677Z [1661272817.858871] [swx-rain04:334083:0] callbackq.c:466 UCX WARN 0 fast-path and 1 slow-path callbacks remain in the queue
2022-08-23T16:40:17.8853843Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test.cc:366: Failure
Would it be possible to restart the test?
@dmitrygx @brminich the following doesn't seem related:
2022-08-23T16:39:07.3474380Z [ RUN ] rcx/test_ucp_am_nbx_seg_size.multi/0 <rc_x> 2022-08-23T16:39:07.6539640Z [ INFO ] seg size 1024 data size 2048 2022-08-23T16:39:07.6609112Z [1661272747.660634] [swx-rain04:334083:0] wireup.c:404 UCX ERROR ep 0x7f8069f7b000: no remote ep address for lane[2]->remote_lane[2] 2022-08-23T16:39:17.6795627Z [1661272757.679247] [swx-rain04:334083:0] ucp_test.cc:301 UCX ERROR request 0x4655050 completed with error Destination is unreachable 2022-08-23T16:39:17.6797646Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/test_ucp_am.cc:489: Failure 2022-08-23T16:39:17.6798651Z Value of: m_am_received 2022-08-23T16:39:17.6799213Z Actual: false 2022-08-23T16:39:17.6799511Z Expected: true 2022-08-23T16:39:17.6800637Z [1661272757.679334] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 2 2022-08-23T16:39:17.6801830Z [1661272757.679340] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 1 2022-08-23T16:39:17.6803004Z [1661272757.679345] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 0 2022-08-23T16:40:17.8255445Z [1661272817.825240] [swx-rain04:334083:0] ucp_test.cc:296 UCX ERROR request 0x4644010 did not complete on time 2022-08-23T16:40:17.8462005Z [1661272817.845928] [swx-rain04:334083:0] mpool.c:55 UCX WARN object 0x4643f00 {flags:0x42 <no debug info>} was not returned to mpool ucp_requests 2022-08-23T16:40:17.8590677Z [1661272817.858871] [swx-rain04:334083:0] callbackq.c:466 UCX WARN 0 fast-path and 1 slow-path callbacks remain in the queue 2022-08-23T16:40:17.8853843Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test.cc:366: Failure
Would it be possible to restart the test?
yes, this is not related to your changes, we see this failure with other PRs. I'll restart the test after the whole CI completion.
@dmitrygx @brminich the following doesn't seem related:
2022-08-23T16:39:07.3474380Z [ RUN ] rcx/test_ucp_am_nbx_seg_size.multi/0 <rc_x> 2022-08-23T16:39:07.6539640Z [ INFO ] seg size 1024 data size 2048 2022-08-23T16:39:07.6609112Z [1661272747.660634] [swx-rain04:334083:0] wireup.c:404 UCX ERROR ep 0x7f8069f7b000: no remote ep address for lane[2]->remote_lane[2] 2022-08-23T16:39:17.6795627Z [1661272757.679247] [swx-rain04:334083:0] ucp_test.cc:301 UCX ERROR request 0x4655050 completed with error Destination is unreachable 2022-08-23T16:39:17.6797646Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/test_ucp_am.cc:489: Failure 2022-08-23T16:39:17.6798651Z Value of: m_am_received 2022-08-23T16:39:17.6799213Z Actual: false 2022-08-23T16:39:17.6799511Z Expected: true 2022-08-23T16:39:17.6800637Z [1661272757.679334] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 2 2022-08-23T16:39:17.6801830Z [1661272757.679340] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 1 2022-08-23T16:39:17.6803004Z [1661272757.679345] [swx-rain04:334083:0] flush.c:28 UCX ERROR req 0x4654f40: error during flush: Endpoint timeout, flush comp 0x4654fd8 count reduced to 0 2022-08-23T16:40:17.8255445Z [1661272817.825240] [swx-rain04:334083:0] ucp_test.cc:296 UCX ERROR request 0x4644010 did not complete on time 2022-08-23T16:40:17.8462005Z [1661272817.845928] [swx-rain04:334083:0] mpool.c:55 UCX WARN object 0x4643f00 {flags:0x42 <no debug info>} was not returned to mpool ucp_requests 2022-08-23T16:40:17.8590677Z [1661272817.858871] [swx-rain04:334083:0] callbackq.c:466 UCX WARN 0 fast-path and 1 slow-path callbacks remain in the queue 2022-08-23T16:40:17.8853843Z /scrap/azure/agent-02/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test.cc:366: Failure
Would it be possible to restart the test?
yes, this issue should be fixed by #8472
@brminich @yosefe could you review pls?
@Akshay-Venkatesh it looks ok now, but there are still many clang-format issues. Most of them, especially long lines, are relevant. Can you pls go over https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=49816&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2 and check?
@yosefe I don't have write access to resolve merge conflicts. Can you help?
@yosefe I don't have write access to resolve merge conflicts. Can you help?
@Akshay-Venkatesh there is no need for upstream repo write access to push merge commits to your PR. You should be able to push it as any commit to the topic branch. Please do not squash or rebase; just push a merge commit with upstream/master.
@Akshay-Venkatesh could you pls fix code style warnings https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=50138&view=logs&j=cc064a77-22b5-56bf-ecc0-70b5fe764261&t=aedfd754-44a6-53e6-6843-8659d800fee2?
@Akshay-Venkatesh pls squash
I squashed changes and pushed directly to this PR. @yosefe could your review pls?
@dmitrygx seems some changes were added by force-push, is this expected?
comparing 1547163 and 45fac04 when merged to b9000ba
- merge of 1547163 to b9000ba is 5aae73b
- merge of 45fac04 to b9000ba is 45fac04
diff --git a/src/uct/cuda/cuda_copy/cuda_copy_md.c b/src/uct/cuda/cuda_copy/cuda_copy_md.c
index 2440cab5..ad9042fb 100644
--- a/src/uct/cuda/cuda_copy/cuda_copy_md.c
+++ b/src/uct/cuda/cuda_copy/cuda_copy_md.c
@@ -16,8 +16,8 @@
#include <ucs/debug/memtrack_int.h>
#include <ucs/type/class.h>
#include <ucs/profile/profile.h>
-#include <uct/cuda/base/cuda_iface.h>
#include <uct/api/v2/uct_v2.h>
+#include <uct/cuda/base/cuda_iface.h>
#include <cuda_runtime.h>
#include <cuda.h>
@@ -44,6 +44,7 @@ static ucs_config_field_t uct_cuda_copy_md_config_table[] = {
{NULL}
};
+
static ucs_status_t
uct_cuda_copy_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr)
{
diff --git a/src/uct/cuda/cuda_ipc/cuda_ipc_md.c b/src/uct/cuda/cuda_ipc/cuda_ipc_md.c
index acbfaa04..88424d61 100644
--- a/src/uct/cuda/cuda_ipc/cuda_ipc_md.c
+++ b/src/uct/cuda/cuda_ipc/cuda_ipc_md.c
@@ -28,6 +28,7 @@ static ucs_config_field_t uct_cuda_ipc_md_config_table[] = {
{NULL}
};
+
static ucs_status_t
uct_cuda_ipc_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr)
{
diff --git a/src/uct/cuda/gdr_copy/gdr_copy_md.c b/src/uct/cuda/gdr_copy/gdr_copy_md.c
index 7c83001e..a7c4d7b5 100644
--- a/src/uct/cuda/gdr_copy/gdr_copy_md.c
+++ b/src/uct/cuda/gdr_copy/gdr_copy_md.c
@@ -43,6 +43,7 @@ static ucs_config_field_t uct_gdr_copy_md_config_table[] = {
{NULL}
};
+
static ucs_status_t
uct_gdr_copy_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr)
{
diff --git a/src/uct/rocm/copy/rocm_copy_md.c b/src/uct/rocm/copy/rocm_copy_md.c
index 261d71d0..38b65520 100644
--- a/src/uct/rocm/copy/rocm_copy_md.c
+++ b/src/uct/rocm/copy/rocm_copy_md.c
@@ -18,8 +18,9 @@
#include <ucs/sys/math.h>
#include <ucs/debug/memtrack_int.h>
#include <ucm/api/ucm.h>
-#include <ucs/type/class.h>
#include <uct/api/v2/uct_v2.h>
+#include <ucs/type/class.h>
+
#include <hsa_ext_amd.h>
static ucs_config_field_t uct_rocm_copy_md_config_table[] = {
@@ -34,6 +35,7 @@ static ucs_config_field_t uct_rocm_copy_md_config_table[] = {
{NULL}
};
+
static ucs_status_t
uct_rocm_copy_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr_v2)
{
diff --git a/src/uct/rocm/ipc/rocm_ipc_md.c b/src/uct/rocm/ipc/rocm_ipc_md.c
index 1eed51bc..e621962b 100644
--- a/src/uct/rocm/ipc/rocm_ipc_md.c
+++ b/src/uct/rocm/ipc/rocm_ipc_md.c
@@ -22,6 +22,7 @@ static ucs_config_field_t uct_rocm_ipc_md_config_table[] = {
{NULL}
};
+
static ucs_status_t uct_rocm_ipc_md_query(uct_md_h md, uct_md_attr_t *md_attr,
uct_md_attr_v2_t *md_attr_v2)
{
@@ -38,7 +39,6 @@ static ucs_status_t uct_rocm_ipc_md_query(uct_md_h md, uct_md_attr_t *md_attr,
/* TODO: get accurate number */
md_attr->reg_cost = ucs_linear_func_make(9e-9, 0);
-
memset(&md_attr->local_cpus, 0xff, sizeof(md_attr->local_cpus));
return UCS_OK;
}
diff --git a/src/uct/sm/mm/posix/mm_posix.c b/src/uct/sm/mm/posix/mm_posix.c
index eefeb66d..7c5743a8 100644
--- a/src/uct/sm/mm/posix/mm_posix.c
+++ b/src/uct/sm/mm/posix/mm_posix.c
@@ -18,7 +18,6 @@
#include <ucs/sys/sys.h>
#include <sys/mman.h>
#include <sys/statvfs.h>
-#include <uct/api/v2/uct_v2.h>
/* File open flags */
diff --git a/src/uct/sm/mm/sysv/mm_sysv.c b/src/uct/sm/mm/sysv/mm_sysv.c
index 4731f97e..d9861868 100644
--- a/src/uct/sm/mm/sysv/mm_sysv.c
+++ b/src/uct/sm/mm/sysv/mm_sysv.c
@@ -14,7 +14,6 @@
#include <ucs/debug/log.h>
#include <ucs/sys/sys.h>
#include <ucs/profile/profile.h>
-#include <uct/api/v2/uct_v2.h>
#define UCT_MM_SYSV_PERM (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP)
@@ -43,6 +42,7 @@ static ucs_config_field_t uct_sysv_iface_config_table[] = {
{NULL}
};
+
static ucs_status_t uct_sysv_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr)
{
uct_mm_md_query(md, md_attr, ULONG_MAX);
diff --git a/src/uct/sm/scopy/knem/knem_md.c b/src/uct/sm/scopy/knem/knem_md.c
index efb4c189..ee664f30 100644
--- a/src/uct/sm/scopy/knem/knem_md.c
+++ b/src/uct/sm/scopy/knem/knem_md.c
@@ -36,6 +36,7 @@ static ucs_config_field_t uct_knem_md_config_table[] = {
{NULL}
};
+
ucs_status_t uct_knem_md_query(uct_md_h uct_md, uct_md_attr_v2_t *md_attr)
{
uct_knem_md_t *md = ucs_derived_of(uct_md, uct_knem_md_t);
diff --git a/src/uct/sm/scopy/knem/knem_md.h b/src/uct/sm/scopy/knem/knem_md.h
index d98a540b..25abc8d5 100644
--- a/src/uct/sm/scopy/knem/knem_md.h
+++ b/src/uct/sm/scopy/knem/knem_md.h
@@ -15,9 +15,11 @@
#include <uct/base/uct_md.h>
#include <uct/api/v2/uct_v2.h>
+
extern uct_component_t uct_knem_component;
ucs_status_t uct_knem_md_query(uct_md_h md, uct_md_attr_v2_t *md_attr);
+
/**
* @brief KNEM MD descriptor
*/
diff --git a/src/uct/sm/self/self.c b/src/uct/sm/self/self.c
index 390d6217..fae19f49 100644
--- a/src/uct/sm/self/self.c
+++ b/src/uct/sm/self/self.c
@@ -15,7 +15,6 @@
#include <ucs/type/class.h>
#include <ucs/sys/string.h>
#include <ucs/arch/cpu.h>
-#include <uct/api/v2/uct_v2.h>
#include "self.h"
@@ -389,6 +388,7 @@ static uct_iface_ops_t uct_self_iface_ops = {
.iface_is_reachable = uct_self_iface_is_reachable
};
+
static ucs_status_t uct_self_md_query(uct_md_h md, uct_md_attr_v2_t *attr)
{
/* Dummy memory registration provided. No real memory handling exists */
diff --git a/src/uct/tcp/tcp_md.c b/src/uct/tcp/tcp_md.c
index b2d90838..6e3d2c33 100644
--- a/src/uct/tcp/tcp_md.c
+++ b/src/uct/tcp/tcp_md.c
@@ -24,6 +24,7 @@ static ucs_config_field_t uct_tcp_md_config_table[] = {
{NULL}
};
+
static ucs_status_t uct_tcp_md_query(uct_md_h md, uct_md_attr_v2_t *attr)
{
/* Dummy memory registration provided. No real memory handling exists */
@dmitrygx seems some changes were added by force-push, is this expected?
@yosefe some of them are only expected when manually rebased. I reverted these changes.
@yosefe squashed, could you reapprove pls?