Roscoe A. Bartlett

Results 1115 comments of Roscoe A. Bartlett

@jjellio, all things considered, given the current state of things, in my opinion,, what we have now is fine and is the best we can do until they came make...

From the updated email thread documented in [CDOFA-94](https://sems-atlassian-srn.sandia.gov/browse/CDOFA-94?focusedCommentId=52797&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-52797), it seems that the upgrade of 'vortex' that would fix the problems with `jsrun` will not occur until April (or later?). The...

The update from the admins documented in [CDOFA-94](https://sems-atlassian-srn.sandia.gov/browse/CDOFA-94?focusedCommentId=52864&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-52864) is that there is a 0.5% chance that any `jsrun` invocation will fail and once one does fail, then all future `jsrun`...

CC: @jjellio FYI: As shown in [this query](https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-03-19&filtercount=6&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Specialized&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=status&compare3=62&value3=passed&field4=testoutput&compare4=95&value4=Error%20initializing%20RM%20connection.%20Exiting&field5=testoutput&compare5=94&value5=Error%3A%20Remote%20JSM%20server%20is%20not%20responding%20on%20host&field6=testoutput&compare6=94&value6=OPAL%20ERROR%3A%20Unreachable), today in the build `Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg` we saw 1267 tests that failed with the the error message: ``` Error: error in ptssup_mkcltsock_afunix() 03-19-2020...

There was an interesting manifestation of the problem. As shown in: * https://jenkins-srn.sandia.gov/view/Trilinos%20ATDM/job/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-2019.06.24_serial_static_dbg/51/consoleFull the update, configure, build, and test results were missing for this build due to the {{lrun}} command...

FYI: They closed: * https://servicedesk.sandia.gov/servicedesk/customer/portal/4/ONESTOP-11603 assuming this was fixed because of an upgrade of 'vortex' last month. But this was not resolved and I create the new issue: * https://servicedesk.sandia.gov/servicedesk/customer/portal/4/ONESTOP-16877...

> I see the same error on our testbed with the LSF resource management system. @lalalaxla, as far as I know, these errors are unique to the ATS-2 system and...

> Then, my question is how to filter out the error? @lalalaxla, are you reporting results to CDash? If so, and if you have a very recent version of CDash,...

FYI: The big upgrade of the software env and LSF on 'vortex' that occurred over the last few days did **NOT** fix these mass random 'jsrun' failures. Today as shown...

After the sysadmins changing 'vortex' to use a private launch node by defaults for 'bsub', it seems that all random jsrun failures are gone. See the evidence below. For more...