op-test
op-test copied to clipboard
RunHostTest: op-test gets killed frequently when running tests
The following trace is observed and op-test process gets killed.
2019-06-19 06:35:24,328:op-test.common.OpTestUtil:dump_versions:INFO:Log Location: /home/harish/upstream/op-test-framework/test-reports/test-run-20190619040239/*debug*
2019-06-19 06:35:24,328:op-test.common.OpTestUtil:dump_versions:INFO:
----------------------------------------------------------
OpTestSystem Firmware Versions Tested
(if flashed things like skiboot.lid, may not be accurate)
----------------------------------------------------------
Firmware Versions Unavailable
----------------------------------------------------------
----------------------------------------------------------
File "./op-test", line 830, in <module>
res = run_tests(t, failfast=OpTestConfiguration.conf.args.failfast)
File "./op-test", line 805, in run_tests
return runner(**kwargs).run(t)
File "/usr/lib64/python2.7/unittest/runner.py", line 151, in run
test(result)
File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
test(result)
File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
test(result)
File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
test(result)
File "/usr/lib64/python2.7/unittest/suite.py", line 70, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python2.7/unittest/suite.py", line 108, in run
test(result)
File "/usr/lib64/python2.7/unittest/case.py", line 433, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "/home/harish/upstream/op-test-framework/testcases/RunHostTest.py", line 84, in runTest
con.run_command(line, timeout=self.host_cmd_timeout)
File "/home/harish/upstream/op-test-framework/common/OpTestSSH.py", line 214, in run_command
output = self.try_command(term_obj, command, timeout)
File "/home/harish/upstream/op-test-framework/common/OpTestUtil.py", line 1454, in try_command
rc = pty.expect([expect_prompt, r"[Pp]assword for", pexpect.TIMEOUT, pexpect.EOF], timeout=timeout)
File "/home/harish/upstream/op-test-framework/common/OPexpect.py", line 87, in expect
searchwindowsize=searchwindowsize)
File "/usr/lib/python2.7/site-packages/pexpect/spawnbase.py", line 341, in expect
timeout, searchwindowsize, async_)
File "/usr/lib/python2.7/site-packages/pexpect/spawnbase.py", line 369, in expect_list
return exp.expect_loop(timeout)
File "/usr/lib/python2.7/site-packages/pexpect/expect.py", line 111, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/lib/python2.7/site-packages/pexpect/pty_spawn.py", line 499, in read_nonblocking
if (timeout != 0) and select(timeout):
File "/usr/lib/python2.7/site-packages/pexpect/pty_spawn.py", line 449, in select
return select_ignore_interrupts([self.child_fd], [], [], timeout)[0]
File "/usr/lib/python2.7/site-packages/pexpect/utils.py", line 143, in select_ignore_interrupts
return select.select(iwtd, owtd, ewtd, timeout)
File "./op-test", line 54, in optest_handler
traceback.print_stack()
2019-06-19 06:35:24,331:op-test:optest_handler:ERROR:OpTestSystem exiting from signal 'SIGHUP' signum=1
Config used
[op-test]
bmc_type=OpenBMC
bmc_username=root
bmc_password=...
bmc_ip=...
bmc_usernameipmi=root
bmc_passwordipmi=...
host_ip=...
host_user=root
host_password=...
host_cmd_file=command.txt
host_cmd_timeout=720000
host_cmd_resultpath='/home/tests/results'
Following commands are run command.txt
cd tests
python avocado-setup.py --run-suite host_test
# ./op-test --run testcases.RunHostTest -c wsp12.conf --host-cmd-file command.txt
Issue is observed very frequently and irrespective of bmc_type
File "/home/harish/upstream/op-test-framework/testcases/RunHostTest.py", line 84, in runTest
con.run_command(line, timeout=self.host_cmd_timeout)
File "/home/harish/upstream/op-test-framework/common/OpTestSSH.py", line 214, in run_command
output = self.try_command(term_obj, command, timeout)
File "/home/harish/upstream/op-test-framework/common/OpTestUtil.py", line 1454, in try_command
rc = pty.expect([expect_prompt, r"[Pp]assword for", pexpect.TIMEOUT, pexpect.EOF], timeout=timeout)
File "/home/harish/upstream/op-test-framework/common/OPexpect.py", line 87, in expect
searchwindowsize=searchwindowsize)
Looks like it's hitting a timeout waiting for "[Pp]assword for" to appear. Please debug further.
The issue seems to occur when the ssh connection of the triggered server is disconnected.
Server running the op-test instance in background thought nohup + &
[root@host op-test-2]# date
Fri Jun 26 03:22:08 EDT 2020
[root@host op-test-2]# Connection to lep8a.aus.stglabs.ibm.com closed.
After logging in again and checking, the background op-test process gets killed at the exact point of disconnection
2020-06-26 03:22:09,144:op-test.common.OpTestUtil:cleanup:INFO:OpTestSystem Starting to Gather ESEL's
...
...
leading to the above trace.
@harish-24 this could help solve that issue you are facing, seems to be framework catches and sighup
bydefault and terminates..
diff --git a/common/OPexpect.py b/common/OPexpect.py
index d2d1832..820bcc5 100644
--- a/common/OPexpect.py
+++ b/common/OPexpect.py
@@ -42,7 +42,7 @@ class spawn(pexpect.spawn):
def __init__(self, command, args=[], maxread=8000,
searchwindowsize=None, logfile=None, cwd=None, env=None,
- ignore_sighup=False, echo=True, preexec_fn=None,
+ ignore_sighup=True, echo=True, preexec_fn=None,
encoding='utf-8', codec_errors='ignore', dimensions=None,
failure_callback=None, failure_callback_data=None):
self.command = command
diff --git a/op-test b/op-test
index 45f5a8c..4fa42a4 100755
--- a/op-test
+++ b/op-test
@@ -138,7 +138,7 @@ def optest_handler(signum, frame):
# try/except block, control will be given back to that block
-signal.signal(signal.SIGHUP, optest_handler)
+#signal.signal(signal.SIGHUP, optest_handler)
signal.signal(signal.SIGINT, optest_handler)
signal.signal(signal.SIGQUIT, optest_handler)
signal.signal(signal.SIGILL, optest_handler)
will have to see for any other issues and will work on a cleaner patch..
@sathnaga It does help, now I dont see the issue with the above hack.
Care to expand a bit more?
If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..
If we're running with nohup ./op-test.py
we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..
Care to expand a bit more?
If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..
If we're running with
nohup ./op-test.py
we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..
Yes, we are running with nohup ./op-test.py ... &
and the problem is that we are still seeing hangups in case of disconnection. The hack provided by @sathnaga in https://github.com/open-power/op-test/issues/508#issuecomment-650077030 avoids that problem.
Care to expand a bit more?
If someone is connected (ssh or whatever) to a machine where the op-test.py instance is running (i.e., driving the test, not a target test system), and that connection is dropped, is only natural that a SIGHUP (Hangup) signal is sent to op-test, who needs to be killed imho..
If we're running with
nohup ./op-test.py
we should not have been seeing hangups in the case of disconnection.. So I'm confused to what the problem is (after all getting traces if the connection is dropped is expected) and what's the proposed solution here..
@klauskiwi More details, the issue does not show up if you run nohup ./op-test.py ... &
cleanly exits (1) from the terminal like typing exit
but it happens (2) while abruptly closing the terminal, in some cases consider if you are running/driving a test from remote terminal and ssh connection drops due to network can also induce the same behaviour, this is because of terminal/bash behaviour.
Demo: (1) Clean exit:
Terminal 1:
$ echo $$
150465
--------------------> do start strace here
$ exit
Terminal 2:
$ strace -e signal -p 150465
strace: Process 150465 attached
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=0x55cb0434b6d0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGHUP, {sa_handler=0x55cb0434bfb0, sa_mask=[HUP INT ILL TRAP ABRT BUS FPE USR1 SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x55cb0434bfb0, sa_mask=[HUP INT ILL TRAP ABRT BUS FPE USR1 SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a4b70, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGWINCH, {sa_handler=0x55cb0434b6c0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb043a37b0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, {sa_handler=0x55cb0434bbe0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f05ecbfcab0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
+++ exited with 0 +++
==========No sighup
(2) Abrupt close:
Terminal 1:
$ echo $$
150512
just close this terminal
Terminal 2
$ strace -e signal -p 150512
strace: Process 150512 attached
...
...
SEGV USR2 PIPE ALRM TERM XCPU XFSZ VTALRM SYS], sa_flags=SA_RESTORER, sa_restorer=0x7f1fe3f9cab0}, 8) = 0
kill(150512, SIGHUP) = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=150512, si_uid=1000} ---
+++ killed by SIGHUP +++
my quick hack just disables sighup
signal handler, I doubt that's a cleaner solution, am still thinking of ways to distinguish above behaviour vs all other sighup
possibilities, so that can only handle that better.
@harish-24 Any updates here ?
@harihare The issue remains the same. Not sure if @sathnaga has any updates after the hack that has been posted.