Trilinos Random mass test failures showing "Error initializing RM connection. Exiting" on 'vortex'

CC: @jjellio

As shown in this query, we are getting random mass failures on test on 'vortex'. As shown in that query, when it occurs in a given build, it impacts over a thousand tests and it is random about which build and which days it impacts. When it occurs, the failures output like:

Error: Remote JSM server is not responding on host vortex5902-19-2020 03:31:02:827 114114 main: Error initializing RM connection. Exiting.

Feb 19 '20 14:02 bartlettroscoe

@jjellio, I am not sure how much detail we can put in this issue but at least this gets this on the board.

I am setting up to filter out tests that show this failure from the CDash summary emails in the filters:

Promoted ATDM Trilinos builds: https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-02-19&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=ATDM&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%3A%20Remote%20JSM%20server%20is%20not%20responding%20on%20host%20vortex
Specialized ATDM Trilinos Builds Cleanup: https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-02-18&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Specialized&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%3A%20Remote%20JSM%20server%20is%20not%20responding%20on%20host%20vortex

That will allow me to start cleaning up the failing tests in these builds but not allow these failures to flood out the the emails and make them worthless.

Feb 19 '20 14:02 bartlettroscoe

I just pushed the following TrilinosATDMStatus repo commit:

*** Base Git Repo: TrilinosATDMStatus
commit b3c45994e778b3d784498f20c8116c419f5a08ed
Author: Roscoe A. Bartlett <[email protected]>
Date:   Wed Feb 19 07:43:36 2020 -0700

    Filter out random mass 'JSM server not responding' errors (trilinos/Trilinos#6861)
    
    It seems when it occurs in a build these test failures showing:
    
       Error: Remote JSM server is not responding on host vortexXXX
    
    are massive, taking down hundreds to thousands of tests once they start.
    
    Adding these filters up front just filters them out.
    
    Note that filtering these tests out before hand will result in tracked tests
    as being listed as missing (twim) if they have this error.
    
    But this way, we can start to triage all of the builds on vortex.
    
    I also updated the builds filter to allow the build:
    
      Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
    
    now that we can filter out mass failures due to this "Remote JSM sever" issue.

M       trilinos_atdm_builds_status.sh
M       trilinos_atdm_specialized_cleanup_builds_status.sh

I also updated the Jenkins job:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt/

so that it will submit to the 'Specialized' CDash group so that I can now clean up those tests.

Hopefully this will allow us to clean up the failing tests even while these random mass test failures showing Remote JSM server is not responding on host are occurring. Even if we only get results ever other day, that should be enough to maintain these builds and get useful test results.

Feb 19 '20 14:02 bartlettroscoe

@jjellio, don't know if this is related, but as shown in this query, we are also seeing failing tests showing errors like:

[csmapi][error]	recvmsg timed out. rc=-1
[csmapi][error]	RECEIVE ERROR. rlen=-1
[csmapi][error]	/home/ppsbld/workspace/PUBLIC_CAST_V1.6.x_ppc64LE_RH7.5_ProdBuild/csmnet/src/C/csm_network_local.c-673: Client-Daemon connection error. errno=11
csm_net_unix_Connect: Resource temporarily unavailable
[csmapi][error]	csm_net_unix_Connect() failed: /run/csmd.sock
Error. Failed to initialize CSM library.
Error: It is only possible to use js commands within a job allocation unless CSM is running
02-19-2020 04:07:02:845 50896 main: Error initializing RM connection. Exiting.

And as shown in this query we are seeing random failures showing:

Warning: PAMI CUDA HOOK disabled

What is that?

I will filter all of these out of the CDash summary email filter.

Feb 19 '20 16:02 bartlettroscoe

Just a warning, because this is a single process, it is disabling the 'cuda hooks' (my 2019 issue).

We can see that from the full output thanks to the patch we pushed through:

AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks'

WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
BEFORE: jsrun  '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/rol/example/PDE-OPT/helmholtz/ROL_example_PDE-OPT_helmholtz_example_02.exe' 'PrintItAll'
out_file=4dd1a321b4bcc5c1c294a7bea8279523.out
Warning: PAMI CUDA HOOK disabled

Feb 19 '20 16:02 jjellio

If we don't disable the cuda hooks, and the process doesn't call MPI_Init first, then the entire test will fail. It is just a benign warning.

Feb 19 '20 16:02 jjellio

@jjellio, as shown here, what does the error:

Error: error in ptssup_mkcltsock_afunix()
02-14-2020 04:59:12:982 24104 main: Error initializing RM connection. Exiting.

mean?

Feb 20 '20 02:02 bartlettroscoe

I'll be optimistic and assume it is an extraterrestrial offering of good will and happiness (I have no idea!)

As users, I don't think we can drill into the JSM/RM connection stuff. The admins are taking a careful look at the software stack to see if perhaps some component is missing. All of this CI / automated stuff is going to tease out all kinds of errors (and that we test the case of NP=1 w/out MPI_Init is going to exercise functionality that I do not think many have used - but we need to do it, Trilinos' integrity is verified by both sequential and parallel unit tests.)

I have a very good reproducer for one the RM connection issues, and I passed on some example scripts that can reproduce and demonstrate proper behavior. SO hopefully we can get this hammered out.... my build tools on Vortex utilize jsrun heavily to do on-node configures and compiles, so this has hindered me as well.

Feb 20 '20 02:02 jjellio

NOTE: Issue #6875 is really a duplicate of this issue.

FYI: With the Trilinos PR testing system down with no ETA to fix, I manually merged the branch in PR #6876 to the 'atdm-nightly-manual-updates' branch just now in commit 47b673b so it will be in the 'atdm-nightly' branch tonight and we will see this running tomorrow.

Putting this Issue in review to see if this fixes the problem.

Feb 24 '20 14:02 bartlettroscoe

This is not a duplicate of #6875. There should still be JSM RM connection issues (they are being patched by Spectrum), but the issue I raised should help with stability. It is also possible that our collab with LLNL actually fixed another issue, so I am curious how impactful the results of 6875 are. It would be nice if we made a huge dent in this broader RM connection problem.

Feb 24 '20 21:02 jjellio

@jjellio, even after the update from PR #6876, we are still getting some errors shown here like:

WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
BEFORE: jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe' '--use-tpetra' '--use-twod' '--cell=Quad' '--x-elements=16' '--y-elements=16' '--z-elements=4' '--basis-order=2'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=; jsrun  '-p' '4' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_opt/SRC_AND_BUILD/BUILD/packages/panzer/adapters-stk/example/MixedCurlLaplacianExample/PanzerAdaptersSTK_MixedCurlLaplacianExample.exe' '--use-tpetra' '--use-twod' '--cell=Quad' '--x-elements=16' '--y-elements=16' '--z-elements=4' '--basis-order=2'
out_file=f5ab0fc965e1c8b955384b56a322d090.out
[vortex3:45451] OPAL ERROR: Unreachable in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[vortex3:45451] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

What does the error OPAL ERROR: Unreachable in file ext3x_client.c mean? From:

https://github.com/open-mpi/ompi/issues/7025

It looks like MPI_Init() is trying to be run twice on the same MPI rank (even if done by two processes). How can anything we are doing cause thise? Is this an LSF bug? Have you seen this before?

Feb 25 '20 19:02 bartlettroscoe

FYI: I have changed the severity of this from ATDM Sev: Critical to ATDM Sev: Nonblocker because This is because I have updated the driver scripts that monitor the ATDM Trilinos builds on CDash to filter out tests that show these errors as described above. Also, as one can see from looking at the ats2 builds over the last 2 weeks, this only occurs in about 1/3rd of the builds or less so we are still getting most test results in 2/3rds of the builds. That is good enough to work on the cleanup of the rest of the tests due to other issues.

Feb 25 '20 20:02 bartlettroscoe

Ross, this is good! It looks like a different type of error, so maybe we actually made some real progress on that elusive RM connection problem!

I am having a face to face w/LLNL folks, and I will ask them about this.

Feb 25 '20 22:02 jjellio

Ross, this is good! It looks like a different type of error, so maybe we actually made some real progress on that elusive RM connection problem!

@jjellio, right. We don't seem to be seeing anymore of the of the failures like:

Error: error in ptssup_mkcltsock_afunix()
02-14-2020 04:59:12:982 24104 main: Error initializing RM connection. Exiting.

However, just to be clear, we are still seeing mass random test failures showing Error: Remote JSM server is not responding on host on 'vortex'. For example, just from today you see:

https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2020-02-25&filtercount=1&showfilters=1&field1=buildname&compare1=61&value1=Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi

showing a ton of these failing tests.

We are told that the March upgrade of 'vortex' may fix these.

Feb 26 '20 01:02 bartlettroscoe

This may be related to using LD_PRELOAD in the wrapper, but when someone undoes that it is going to cause the tests that alloc before MPI_init to fail. (so we decided to delay it).

It would be worthwhile to delete the LD_PRELOAD stuff and see if that resolves the MPI_init failures (by using LD_PRELOAD I can promise we are doing things outside the way it was intended)

Feb 26 '20 14:02 jjellio

FYI: It was suggested in the email chain captured in CDOFA-94 that one workaround would be to kill an allocation and rerun failing tests over and over again when we see the first test failure showing Error: Remote JSM server is not responding on host. That would make for a very complex implementation of the ctest -S driver. I am not sure that would even be possible without extending ctest. I don't even want to think about something like that. An alternative approach would be to run ctest on the login node and get a new interactive bsub allocation for each individual test and that would resolve the problem. But boy would that be slow as it can take several seconds (or longer) to get an allocation and some tests in Trilinos finish in a fraction of a second. Each build has around 2200 tests so that would be about 2200 interactive bsub calls per build!

Here are some option for what to do:

Keep doing what we are doing: We are getting full results about 1/2 of the time on average and our cdash_analyze_and_report.py drivers are filter out these test failures so we are still able to monitor things okay.
Turn off the tests on 'vortex' until they can get a working and robust batch system capable of running our test suite: That would save a huge amount of time and we can leave it to the APP teams to fight with this mess.

Feb 28 '20 13:02 bartlettroscoe

Ross, would it make sense to lower the cadence of testing on vortex. That is, run the test suite weekly and just keep rerunning it till it works? It let the tests run, but it would give a coarser granularity for figuring out the culprit is a test failed. I tend to think running (and rerunning over a weeks time) would atleast get the library tested - which may be a better alternative than waiting for the machine to work.

Feb 28 '20 16:02 jjellio

@jjellio, all things considered, given the current state of things, in my opinion,, what we have now is fine and is the best we can do until they came make the system more stable. And for the most part, running the test suite "until it works" would never terminate because there are always at least some failing tests (just look at the rest of the ATDM Trilinos builds).

It is likely better to discuss this offline than to try to do this in detail over github issue comments.

Feb 28 '20 17:02 bartlettroscoe

From the updated email thread documented in CDOFA-94, it seems that the upgrade of 'vortex' that would fix the problems with jsrun will not occur until April (or later?). The proposed solution is to run less than 800 jsrun jobs in a single bsub allocation. They claim that should be robust. Therefore, I think we should trim down the Trilinos test suite we run on 'vortex' to just be a few of the critical and actively developed packages like Kokkos, KokkosKernels, Tpetra, Zoltan2, Amesos2, SEACAS, MueLu, and Panzer. Adding up the number of tests for these packages for the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi on 2020-02-28 shown here gives 727 tests.

So I guess the plan needs to be to trim down the test that we run on 'vortex'. I think we should still build all of the tests in all of the packages, we will just run a subset of them. To do that, I will need to make a small extension to TriBITS. I will do that when I get some time.

Feb 28 '20 21:02 bartlettroscoe

The update from the admins documented in CDOFA-94 is that there is a 0.5% chance that any jsrun invocation will fail and once one does fail, then all future jsrun invocations will fail in that bsub node allocation. ETA to fix this is not until an upgrade of the system currently scheduled until April (which means May or later).

At this point, I am not sure what to do about this GitHub Issue. We can't close this issue because it is not really resolved. But there is not really anything we can do about it.

I think I should just leave this "In Review" and then put the "Stalled" labor on this. I think the system we have with the cdash_analyze_and_report.py usage by filtering out these failures is okay for now but it might be nice to filter out these failures as well from the test history since it inflates the number of failures shown in the test history. That will complicate that Python code some but it might be worth it.

This is the most extreme case of system issues that we have had to deal with in the last 2 years.

Mar 03 '20 15:03 bartlettroscoe

CC: @jjellio

FYI: As shown in this query, today in the build Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg we saw 1267 tests that failed with the the error message:

Error: error in ptssup_mkcltsock_afunix()
03-19-2020 04:05:50:783 40194 main: Error initializing RM connection. Exiting.

That is a different error message that we have been seeing before which looked like:

Error: Remote JSM server is not responding on host vortex5902-19-2020 03:31:02:827 114114 main: Error initializing RM connection. Exiting.

What is messing is the string Error: Remote JSM server is not responding on host which I was using to filter out these mass jsrun failures.

I will update the CDash analysis queries to filter out based on Error initializing RM connection. Exiting instead of Error: Remote JSM server is not responding on host.

Mar 19 '20 14:03 bartlettroscoe

There was an interesting manifestation of the problem. As shown in:

https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg/51/consoleFull

the update, configure, build, and test results were missing for this build due to the {{lrun}} command failing with:

05:09:58 + env CTEST_DO_TEST=FALSE lrun -n 1 /vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg/Trilinos/cmake/ctest/drivers/atdm/ctest-s-driver.sh
05:10:00 Error: Remote JSM server is not responding on host vortex5903-22-2020 03:10:00:000 68525 main: Error initializing RM connection. Exiting.

Now, there were also mass jsrun failures when the tests ran as well as shown here but what is interesting is that 545 of the tests actaully passed! And several of those tests were np=4 MPI tests.

What this suggests is that it is not true that once the first jsrun command that all of the following jsrun commands will fail. If that were the case, after the first jsrun command for the update, configure, build, and test results failed, then all of the following jsrun commands should have failed as well.

That this also shows is that we need to update the cdash_analyze_and_report.py tool to look on CDash for any missing results including update, configure, build, or test results. If any of those are missing for a given build on CDash, then the build should be listed as "missing" and the things that are missing should be listed in the "Missing Status" field. So you would just list "Update", "Configure" and "Build" in the "Missing Status" field for this build.

Mar 22 '20 15:03 bartlettroscoe

FYI: They closed:

https://servicedesk.sandia.gov/servicedesk/customer/portal/4/ONESTOP-11603

assuming this was fixed because of an upgrade of 'vortex' last month. But this was not resolved and I create the new issue:

https://servicedesk.sandia.gov/servicedesk/customer/portal/4/ONESTOP-16877

I think we are going to be living with this problem for the foreseeable future (so we might as well further refine our processes to deal with this better).

Apr 24 '20 20:04 bartlettroscoe

sorry, This is Bing. I see the same error on our testbed with the LSF resource management system.

I came across this discussion and seems the issue is still alive. Any suggestion/conclusion on this?

thanks.

Jun 16 '20 13:06 ghost

I see the same error on our testbed with the LSF resource management system.

@lalalaxla, as far as I know, these errors are unique to the ATS-2 system and the jsrun driver.

Any suggestion/conclusion on this?

No, this is still ongoing as you can see from the mass random test failures shown here. But we just filter these out and the don't really damage too much our ability to do testing on this system and stay on top of them.

There is supposed to be an upgrade system in the near future that is supposed to resolve these issues. (Cross out fingers.)

Jun 16 '20 13:06 bartlettroscoe

I see. thanks. Then, my question is how to filter out the error?

I am reporting this error from an Oak Ridge testbed system Tundra. It is a single rack with 18 nodes similar to Summit, but with reduced hardware (POWER8 CPUs, ½ the bandwidth and memory per node but similar NVMe SSDs).

We actually see the same error on Summit (https://docs.olcf.ornl.gov/systems/summit_user_guide.html). The error is reported on this page, you can search "Remote JSM server" to locate it.

Jun 16 '20 14:06 ghost

Then, my question is how to filter out the error?

@lalalaxla, are you reporting results to CDash? If so, and if you have a very recent version of CDash, then you can use the new "Test Output" filter on the cdash/queryTests.php page to filter them out. See above. You can see an example of these filters in action in:

https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-06-13&filtercount=10&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=94&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=94&value5=OPAL%20ERROR%3A%20Unreachable&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field8=testoutput&compare8=96&value8=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field9=testoutput&compare9=94&value9=jsrun%20return%20value%3A%20255&field10=testoutput&compare10=96&value10=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect

But, again, you will need a very recent version of CDash. (I can provide the info on a safe recent version.)

Jun 16 '20 14:06 bartlettroscoe

Roscoe,

thanks! I will check internally to see if we take the same steps and have the most updated CDash. Will get back to you soon.

Bing

Jun 16 '20 15:06 ghost

FYI: The big upgrade of the software env and LSF on 'vortex' that occurred over the last few days did NOT fix these mass random 'jsrun' failures. Today as shown here where you see the build Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg had 1286 mass failures as shown in this query.

So we live on with these mass random failures. But we are filtering them out okay so they are not terribly damaging to the testing process for Trilinos.

Jul 03 '20 14:07 bartlettroscoe

After the sysadmins changing 'vortex' to use a private launch node by defaults for 'bsub', it seems that all random jsrun failures are gone. See the evidence below. For more context and info, see ATDV-402.

As shown in the CDash queries:

it looks like whatever they did to update 'vortex', it looks like all of the mass random test failures have gone away (or at least we have not seen any mass failures for over 2 weeks with the last mass failure on 2020-10-21). Those queries show that in 2 weeks starting 2020-10-22 going through 2020-11-04, there were 78 + 26 = 104 Trilinos-atdm-ats2 builds that ran tests and in not one of these were there any mass random test failures!

I think this is pretty good evidence that this issue is resolved.

Nov 05 '20 20:11 bartlettroscoe

Shoot, it looks like we had another case of mass random jsrun failures on 2020-11-06 shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2020-11-06&filtercount=1&showfilters=1&field1=buildname&compare1=61&value1=Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi

showing:

Site	Build Name	Update	Update Time	Conf Err	Conf Warn	Conf Time	Build Err	Build Warn	Build Time	Test Not Run	Test Fail	Test Pass	Test Time	Test Proc Time	Start Test Time	Labels
vortex	Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi	0	4	4m 9s	0	0	1m 31s	0	2295	27	47m 23s	3h 6m 1s	Nov 06, 2020 - 17:31 MST	(31 labels)

with 2295 failing tests matching this criteria shown at:

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-11-01&end=2020-11-09&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi&field2=groupname&compare2=62&value2=Experimental&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=95&value4=Error%20initializing%20RM%20connection.%20Exiting

I need to reopen this issue. And I will bring back the filters for this :-(

Nov 09 '20 19:11 bartlettroscoe

Trilinos Trilinos copied to clipboard

Random mass test failures showing "Error initializing RM connection. Exiting" on 'vortex'

Trilinos
Trilinos copied to clipboard