MPI_T_pvar_get_num() returns extraneous number of variables and causes MPI_T_pvar_get_info() to return with error later
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
- 4.1.4
- 4.1.2 on a separate machine (laptop)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
- 4.1.4 was built from source tarball with "--prefix" option
- 4.1.2 was installed using Linux Mint's package manager (Synaptic)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe the system on which you are running
- O/S:
- 4.1.4 was checked with "Rocky Linux 8.6 (Green Obsidian)"
- 4.1.2 was checked with "Linux Mint 21.1"
Details of the problem
- Running the following code makes it return with more number of pvars than it is probably available and trying to extract them returns with error.
#include"mpi.h"
#include<stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int provided;
MPI_T_init_thread(MPI_THREAD_SINGLE, &provided);
int num;
MPI_T_pvar_get_num(&num);
printf("%d performance variables\n", num);
char name[256], desc[256];
int len= (int) (sizeof(name)/ sizeof(name[0]));
int namelen, desclen, verbosity, varclass, bind, rdonly, cont, atomic;
MPI_Datatype dt;
MPI_T_enum et;
for(int i= 0; i< num; ++i) {
int err= MPI_T_pvar_get_info(i, name, &namelen, &verbosity, &varclass, &dt, &et, desc, &desclen, &bind, &rdonly, &cont, &atomic);
name[len- 1]= '\0'; desc[len- 1]= '\0';
printf("%2d: ", i);
switch(err) { /* do not call MPI_Error_string(), check what is happenning */
case MPI_SUCCESS:
printf("%s (%d/ %d) - desc: \"%s\" (%d/%d)\n", name, namelen, len- 1, desc, desclen, len- 1);
break;
case MPI_T_ERR_NOT_INITIALIZED:
printf("The MPI tool information interface is not initialized(%d).\n", err);
break;
case MPI_T_ERR_INVALID_INDEX:
printf("Index is invalid or has been deleted(%d).\n", err);
break;
case MPI_T_ERR_INVALID:
/* function specification does not say this error will be returned. */
/* <https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_T_pvar_get_info.3.html> */
printf("Strange error: Invalid use of the interface or bad parameter values(s)(%d).\n", err);
break;
default:
printf("Strange error: Really unknown error!\n");
break;
}
}
MPI_T_finalize();
int ret= MPI_Finalize();
return ret;
}
Output for OpenMPI-4.1.4:
$ mpirun -np 1 ./a.out
20 performance variables
0: ��D� (31/ 255) - desc: "" (68/255)
1: Strange error: Invalid use of the interface or bad parameter values(s)(72).
2: Strange error: Invalid use of the interface or bad parameter values(s)(72).
3: Strange error: Invalid use of the interface or bad parameter values(s)(72).
4: Strange error: Invalid use of the interface or bad parameter values(s)(72).
5: Strange error: Invalid use of the interface or bad parameter values(s)(72).
6: Strange error: Invalid use of the interface or bad parameter values(s)(72).
7: Strange error: Invalid use of the interface or bad parameter values(s)(72).
8: Strange error: Invalid use of the interface or bad parameter values(s)(72).
9: Strange error: Invalid use of the interface or bad parameter values(s)(72).
10: Strange error: Invalid use of the interface or bad parameter values(s)(72).
11: Strange error: Invalid use of the interface or bad parameter values(s)(72).
12: Strange error: Invalid use of the interface or bad parameter values(s)(72).
13: Strange error: Invalid use of the interface or bad parameter values(s)(72).
14: Strange error: Invalid use of the interface or bad parameter values(s)(72).
15: Strange error: Invalid use of the interface or bad parameter values(s)(72).
16: Strange error: Invalid use of the interface or bad parameter values(s)(72).
17: Strange error: Invalid use of the interface or bad parameter values(s)(72).
18: osc_rdma_put_retry_count (25/ 255) - desc: "Number of times put transaction were retried due to resource limita" (68/255)
19: osc_rdma_get_retry_count (25/ 255) - desc: "Number of times get transaction were retried due to resource limita" (68/255)
Output for OpenMPI-4.1.2 is similar:
$ mpirun -np 1 ./a.out
46 performance variables
(31/ 255) - desc: "" (68/255)
1: pml_ob1_unexpected_msgq_length (31/ 255) - desc: "Number of unexpected messages received by each peer in a communicat" (68/255)
2: pml_ob1_posted_recvq_length (28/ 255) - desc: "Number of unmatched receives posted for each peer in a communicator" (68/255)
3: Strange error: Invalid use of the interface or bad parameter values(s)(72).
4: Strange error: Invalid use of the interface or bad parameter values(s)(72).
5: Strange error: Invalid use of the interface or bad parameter values(s)(72).
6: Strange error: Invalid use of the interface or bad parameter values(s)(72).
7: Strange error: Invalid use of the interface or bad parameter values(s)(72).
8: Strange error: Invalid use of the interface or bad parameter values(s)(72).
9: Strange error: Invalid use of the interface or bad parameter values(s)(72).
10: Strange error: Invalid use of the interface or bad parameter values(s)(72).
11: Strange error: Invalid use of the interface or bad parameter values(s)(72).
12: Strange error: Invalid use of the interface or bad parameter values(s)(72).
13: Strange error: Invalid use of the interface or bad parameter values(s)(72).
14: Strange error: Invalid use of the interface or bad parameter values(s)(72).
15: Strange error: Invalid use of the interface or bad parameter values(s)(72).
16: Strange error: Invalid use of the interface or bad parameter values(s)(72).
17: Strange error: Invalid use of the interface or bad parameter values(s)(72).
18: Strange error: Invalid use of the interface or bad parameter values(s)(72).
19: Strange error: Invalid use of the interface or bad parameter values(s)(72).
20: Strange error: Invalid use of the interface or bad parameter values(s)(72).
21: Strange error: Invalid use of the interface or bad parameter values(s)(72).
22: Strange error: Invalid use of the interface or bad parameter values(s)(72).
23: Strange error: Invalid use of the interface or bad parameter values(s)(72).
24: Strange error: Invalid use of the interface or bad parameter values(s)(72).
25: Strange error: Invalid use of the interface or bad parameter values(s)(72).
26: Strange error: Invalid use of the interface or bad parameter values(s)(72).
27: Strange error: Invalid use of the interface or bad parameter values(s)(72).
28: Strange error: Invalid use of the interface or bad parameter values(s)(72).
29: Strange error: Invalid use of the interface or bad parameter values(s)(72).
30: Strange error: Invalid use of the interface or bad parameter values(s)(72).
31: osc_rdma_put_retry_count (25/ 255) - desc: "Number of times put transaction were retried due to resource limita" (68/255)
32: osc_rdma_get_retry_count (25/ 255) - desc: "Number of times get transaction were retried due to resource limita" (68/255)
33: mtl_psm2_rx_user_bytes (23/ 255) - desc: "Bytes received into a matched user buffer" (42/255)
34: mtl_psm2_rx_user_num (21/ 255) - desc: "Messages received into a matched user buf" (42/255)
35: mtl_psm2_rx_sys_byte (21/ 255) - desc: "Bytes received into an unmatched system b" (42/255)
36: mtl_psm2_rx_sys_num (20/ 255) - desc: "Messages received into an unmatched syste" (42/255)
37: mtl_psm2_tx_num (16/ 255) - desc: "Total Messages transmitted (shm and hfi)" (41/255)
38: mtl_psm2_tx_eag (16/ 255) - desc: "Messages transmitted eagerly" (29/255)
39: mtl_psm2_tx_eag (16/ 255) - desc: "Bytes transmitted eagerl" (25/255)
40: mtl_psm2_tx_rnd (16/ 255) - desc: "Messages transmitted usi" (25/255)
41: mtl_psm2_tx_rnd (16/ 255) - desc: "Bytes transmitted using " (25/255)
42: mtl_psm2_tx_shm (16/ 255) - desc: "Messages transmitted (sh" (25/255)
43: mtl_psm2_rx_shm (16/ 255) - desc: "Messages received throug" (25/255)
44: mtl_psm2_rx_sys (16/ 255) - desc: "Number of system buffers" (25/255)
45: mtl_psm2_rx_sys (16/ 255) - desc: "Bytes allocated for syst" (25/255)
Just had a quick look at this:
I checked with MPI 4.1.5 and it seems that some components are removed in MPI_Init() causing related performance variables to be invalidated via mca_base_pvar_makr_invalid, which causes the later problems.
Removing MPI_Init and MPI_Finalize will run the program without errors.
Note: The current main branch (a75f933a923c0f3cb53a887f4dd4158aeffc9695) works fine for me.
This seems to have the same root cause as https://github.com/open-mpi/ompi/pull/11475, the underlying MCA param does not deal gracefully with the re-registration of events, aka multiple calls to opal_utils_init/_fini.
Seeing this now marked for 5.0.x I tried with the current v5.0.x (f4f52032ea8daa067f91b78fb651a9aa23c752d8) and cannot reproduce the problem in this issue with it.
@cniethammer That's very good news. Can you try aggressively with MPI_T and/or sessions to ensure that we don't have the same issue? E.g., check with a debugger / valgrind / whatever.
Hey @jsquyres , I also encountered this issue with v5.1.0a1 (5560bde). I found that it is caused by the MCA component that registers these performance variables not being selected.
As mentioned by @cniethammer , the "Strange error" only occurs when MPI_Init is called before. Remove the MPI_Init, and the output will be like this:
mpirun -np 1 ./show_mpi_t_info
20 performance variables
0: ���� (31/ 255) - desc: "" (1/255)
1: osc_rdma_put_retry_count (25/ 255) - desc: "" (1/255)
2: osc_rdma_get_retry_count (25/ 255) - desc: "" (1/255)
3: pml_monitoring_flush (21/ 255) - desc: "" (1/255)
4: pml_monitoring_messa (21/ 255) - desc: "" (1/255)
5: pml_monitoring_messa (21/ 255) - desc: "" (1/255)
6: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
7: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
8: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
9: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
10: coll_monitoring_mess (21/ 255) - desc: "" (1/255)
11: coll_monitoring_mess (21/ 255) - desc: "" (1/255)
12: coll_monitoring_o2a_ (21/ 255) - desc: "" (1/255)
13: coll_monitoring_o2a_ (21/ 255) - desc: "" (1/255)
14: coll_monitoring_a2o_ (21/ 255) - desc: "" (1/255)
15: coll_monitoring_a2o_ (21/ 255) - desc: "" (1/255)
16: coll_monitoring_a2a_ (21/ 255) - desc: "" (1/255)
17: coll_monitoring_a2a_ (21/ 255) - desc: "" (1/255)
18: pml_ob1_unexpected_m (21/ 255) - desc: "" (1/255)
19: pml_ob1_posted_recvq (21/ 255) - desc: "" (1/255)
So there are performance variables registered by coll/monitoring, pml/monitoring, osc/monitoring, pml/ob1, and osc/rdma, Which casued errors in @kingshuk00 's output. With MPI_Init() called, if pml_monitoring_enable is set to 2, the output is:
mpirun -np 1 --mca pml_monitoring_enable 2 ./show_mpi_t_info
20 performance variables
0: ���� (31/ 255) - desc: "" (1/255)
1: pml_monitoring_flush (21/ 255) - desc: "" (1/255)
2: pml_monitoring_messa (21/ 255) - desc: "" (1/255)
3: pml_monitoring_messa (21/ 255) - desc: "" (1/255)
4: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
5: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
6: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
7: osc_monitoring_messa (21/ 255) - desc: "" (1/255)
8: coll_monitoring_mess (21/ 255) - desc: "" (1/255)
9: coll_monitoring_mess (21/ 255) - desc: "" (1/255)
10: coll_monitoring_o2a_ (21/ 255) - desc: "" (1/255)
11: coll_monitoring_o2a_ (21/ 255) - desc: "" (1/255)
12: coll_monitoring_a2o_ (21/ 255) - desc: "" (1/255)
13: coll_monitoring_a2o_ (21/ 255) - desc: "" (1/255)
14: coll_monitoring_a2a_ (21/ 255) - desc: "" (1/255)
15: coll_monitoring_a2a_ (21/ 255) - desc: "" (1/255)
16: Strange error: Invalid use of the interface or bad parameter value(s)(72).
17: Strange error: Invalid use of the interface or bad parameter value(s)(72).
18: osc_rdma_put_retry_c (21/ 255) - desc: "" (1/255)
19: osc_rdma_get_retry_c (21/ 255) - desc: "" (1/255)
Comparing the two results, we can see that pml_ob1_unexpected_m and pml_ob1_posted_recvq are missing, and "Strange errors" occur.
It seems that if a component is not selected, the mca_base_pvar_flag_t of its registered performance variables will be set to MCA_BASE_PVAR_FLAG_INVALID, causing this issue.
Calling stack:
[admin1:30550] [ 5] /public/home/rdma22/wjy/install/openmpi/lib/libopen-pal.so.0(mca_base_var_group_deregister+0x108)[0x2ac6ddc1f3a8]
[admin1:30550] [ 6] /public/home/rdma22/wjy/install/openmpi/lib/libopen-pal.so.0(mca_base_component_unload+0x36)[0x2ac6ddc13a46]
[admin1:30550] [ 7] /public/home/rdma22/wjy/install/openmpi/lib/libopen-pal.so.0(mca_base_components_close+0x46)[0x2ac6ddc13b46]
[admin1:30550] [ 8] /public/home/rdma22/wjy/install/openmpi/lib/libmpi.so.0(mca_pml_base_select+0x43a)[0x2ac6daf1318a]
[admin1:30550] [ 9] /public/home/rdma22/wjy/install/openmpi/lib/libmpi.so.0(+0x9873e)[0x2ac6dada373e]
[admin1:30550] [10] /public/home/rdma22/wjy/install/openmpi/lib/libmpi.so.0(ompi_mpi_instance_init+0x5b)[0x2ac6dada3f7b]
[admin1:30550] [11] /public/home/rdma22/wjy/install/openmpi/lib/libmpi.so.0(ompi_mpi_init+0xe1)[0x2ac6dad973c1]
[admin1:30550] [12] /public/home/rdma22/wjy/install/openmpi/lib/libmpi.so.0(MPI_Init+0x9b)[0x2ac6dadc71eb]
mca_base_pvar_mark_invalid(params[i]); is called, where the flag is set to invalid:
int mca_base_pvar_mark_invalid(int index)
{
mca_base_pvar_t *pvar;
int ret;
ret = mca_base_pvar_get_internal(index, &pvar, false);
if (OPAL_SUCCESS != ret) {
return ret;
}
pvar->flags |= MCA_BASE_PVAR_FLAG_INVALID;
return OPAL_SUCCESS;
}
@jywangx Interesting. Any thoughts on a fix? @cniethammer? @bosilca?
First, the weird output of the test provided here is due to a bug in the test. MPI_T_pvar_get_info has the name_len and desc_len parameters as INOUT. This means before the call they must be set to the length of the name and desc arrays, and upon return they will contain the updated length of the pertinent information in these arrays. Add
namelen = (int) (sizeof(name)/ sizeof(name[0]));
desclen = (int) (sizeof(desc)/ sizeof(desc[0]));
before the call to MPI_T_pvar_get_info in the loop to fix this issue.
Second, I don't see any issues with this behavior. As OMPI dynamically load/unload components it is normal that some of the performance variables run out of scope (basically all those registered by unloaded components). We could have removed them from the list of performance variables, but the MPI standard prevents this (by requiring that the index of a performance variable never change). Thus, returning invalid index is a sensible way to prevent the user from using retired performance variables.
Last, but not least, what really bothers me is the following sentence in the MPI standard (Section 15.3.7)
After a successful call to MPI_T_PVAR_GET_INFO for a particular variable, subsequent 46 calls to this routine that query information about the same variable must return the same 47 information. An MPI implementation is not allowed to alter any of the returned values.
Because of this listing the performance variables before and after MPI_Init will break this requirement in OMPI.
#include"mpi.h"
#include<stdio.h>
int list_pvars(char* msg)
{
int num;
MPI_T_pvar_get_num(&num);
printf("%s %d performance variables\n", msg, num);
char name[256], desc[256];
const int len = (int) (sizeof(name)/ sizeof(name[0]));
int namelen, desclen, verbosity, varclass, bind, rdonly, cont, atomic;
MPI_Datatype dt;
MPI_T_enum et;
for(int i= 0; i< num; ++i) {
namelen = (int) (sizeof(name)/ sizeof(name[0]));
desclen = (int) (sizeof(desc)/ sizeof(desc[0]));
int err= MPI_T_pvar_get_info(i, name, &namelen, &verbosity, &varclass, &dt, &et, desc, &desclen, &bind, &rdonly, &cont, &atomic);
printf("%2d: ", i);
switch(err) { /* do not call MPI_Error_string(), check what is happenning */
case MPI_SUCCESS:
name[namelen-1]= '\0'; desc[desclen-1]= '\0';
printf("%s (%d/ %d) - desc: \"%s\" (%d/%d)\n", name, namelen, len-1, desc, desclen, len-1);
break;
case MPI_T_ERR_NOT_INITIALIZED:
printf("The MPI tool information interface is not initialized(%d).\n", err);
break;
case MPI_T_ERR_INVALID_INDEX:
printf("Index is invalid or has been deleted(%d).\n", err);
break;
case MPI_T_ERR_INVALID:
/* function specification does not say this error will be returned. */
/* <https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_T_pvar_get_info.3.html> */
printf("Strange error: Invalid use of the interface or bad parameter values(s)(%d).\n", err);
break;
default:
printf("Strange error: Really unknown error!\n");
break;
}
}
}
int main(int argc, char *argv[])
{
int provided;
MPI_T_init_thread(MPI_THREAD_SINGLE, &provided);
list_pvars("\n\nbefore MPI_Init - ");
MPI_T_finalize();
MPI_Init(&argc, &argv);
MPI_T_init_thread(MPI_THREAD_SINGLE, &provided);
list_pvars("\n\nafter MPI_Init - ");
MPI_T_finalize();
int ret= MPI_Finalize();
return ret;
}