build
build copied to clipboard
Stalled processes not cleared on IBM i
IBM i builds have been failing on test-iinthecloud-ibmi73-ppc64_be-1
since https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/ due to a dangling node process.
i.e. https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/consoleFull
10:22:35 ps awwx | grep Release/node | grep -v grep | cat
10:22:35 38123848 - A 0:25 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
10:22:36 gmake[1]: *** [Makefile:532: test-ci] Error 1
This process is leftover from https://ci.nodejs.org/job/node-test-commit-ibmi/742/nodes=ibmi73-ppc64/ where parallel/test-child-process-exec-abortcontroller-promisified
timed out -- the test spawns the process in https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/test/parallel/test-child-process-exec-abortcontroller-promisified.js#L15
The Node.js Makefile
is supposed to be able to clear stalled/dangling out/Release/node
processes in clear-stalled
: https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/Makefile#L460-L466
clear-stalled:
$(info Clean up any leftover processes but don't error if found.)
ps awwx | grep Release/node | grep -v grep | cat
@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
if [ "$${PS_OUT}" ]; then \
echo $${PS_OUT} | xargs kill -9; \
fi
but it looks like on IBM i this isn't killing the process:
-bash-5.1$ ps -ef | grep out/Release/node
iojs 38123848 1 0 Apr 26 - 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
38123848 - A 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ ps -ef | grep out/Release/node
iojs 38123848 1 0 Apr 26 - 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$
If I add some debug into the Makefile I can see that xargs gets the process ID but it looks like kill -9
isn't terminating the process?
-bash-5.1$ git diff
diff --git a/Makefile b/Makefile
index a6549a8474..5bf612a70d 100644
--- a/Makefile
+++ b/Makefile
@@ -463,6 +463,7 @@ clear-stalled:
@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
if [ "$${PS_OUT}" ]; then \
echo $${PS_OUT} | xargs kill -9; \
+ echo $${PS_OUT} | xargs echo =; \
fi
.PHONY: test-build
-bash-5.1$ ps -ef | grep out/Release/node
iojs 38123848 1 0 Apr 26 - 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
38123848 - A 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
= 38123848
-bash-5.1$ ps -ef | grep out/Release/node
iojs 38123848 1 0 Apr 26 - 1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$
@ThePrez Any ideas?
(I'm assuming we can manually clear the stalled process to get the CI passing but it would be good if the automation in the build scripts just worked.)
This is very strange, indeed!
The phenomenon is easily repeatable by simply running the node -e "setInterval(()=>{}, 99)"
in a background job.
Strangely:
-
kill -9
from abash
shell works -
kill -9
fromsh
works -
kill -9
from xargs inside a Makefile does NOT work 👎 -
kill -KILL
from abash
shell works -
kill -KILL
fromsh
works -
kill -KILL
from xargs inside a Makefile works
So an easy fix would be to simply change the Makefile do use -KILL
instead of -9
. I can't imagine that would cause any issue on other platforms.
Regardless, I'm still trying to figure out root cause. IBM i has two different types of signals: ILE and PASE (Node.js runs in PASE), and the numerical representations differ:
- PASE SIGKILL = 9
- ILE SIGIO = 9
- ILE SIGKILL = 12
But a
kill -12
from xargs in the Makefile also fails, so I think that's a "red herring."
~~Regardless, that xargs
invocation should have -n 1
. Would you like me to open a separate issue for that?~~
oops, no you don't. Disregard!
Regardless, that
xargs
invocation should have-n 1
. Would you like me to open a separate issue for that?
👍
this works
clear-stalled:
$(info Clean up any leftover processes but don't error if found.)
ps awwx | grep Release/node | grep -v grep | cat
@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
if [ "$${PS_OUT}" ]; then \
kill -9 $${PS_OUT}; \
fi
as does (as mentioned)
clear-stalled:
$(info Clean up any leftover processes but don't error if found.)
ps awwx | grep Release/node | grep -v grep | cat
@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
if [ "$${PS_OUT}" ]; then \
echo $${PS_OUT} | xargs -t kill -KILL; \
fi
In my experimentation, it seems that xargs
and -9
together are needed to recreate. This makes no sense.
We debugged this today and discovered the root cause turns out to a bug in the GNU kill
, ie. /QOpenSys/pkgs/bin/kill. https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=900b5621e685df7ffd001fc64bc9d44b06b13900
This affects using GNU kill with pretty much any numeric value, not just kill -9
. As a "workaround", you could use the correct bit pattern for signal 9 on AIX, ie. /QOpenSys/pkgs/bin/kill -589825 pid
:joy::joy::joy: Otherwise, you can use the system version of kill at /QOpenSys/usr/bin/kill or use kill -KILL
.
I'm working on an update with the fix, but due to some infrastructure issues this won't be available for a while.
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.
@abmusse is this something you could take a look at?
Yes, I'll take a look at this one
@mhdawson
Looks like we push the fix up in coreutils-gnu 8.25-9
. We have an outdated version on the build system. Likely we just need to run a yum upgrade coreutils-gnu
on the build system.
@abmusse On test-iinthecloud-ibmi73-ppc64_be-1:
-bash-5.1$ yum info coreutils-gnu
Installed Packages
Name : coreutils-gnu
Arch : ppc64
Version : 8.25
Release : 6
Size : 118 M
Repo : installed
From repo : ibm
Summary : GNU coreutils
URL : https://www.gnu.org/software/coreutils
License : GPL-3.0-or-later
Description : The GNU Core Utilities are the basic file, shell and text manipulation utilities
: of the GNU operating system. These are the core utilities which are expected to
: exist on every operating system.
-bash-5.1$ yum upgrade coreutils-gnu
Setting up Upgrade Process
No Packages marked for Update
-bash-5.1$
What repos does this box have?
yum repolist all
We migrated base repos last year. This box may need the ibmi-repos upgrade.
https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html#transition
-bash-5.1$ yum repolist all
repo id repo name status
ibm ibm enabled: 1002
ibm-7.3 ibm-7.3 disabled
ibmi-base IBM i base enabled: 1002
ibmi-release IBM i 7.3 enabled: 67
repolist: 2071
-bash-5.1$
What url does ibmi-base point to?
cat /QOpenSys/etc/yum/repos.d/ibmi-base.repo
I suspect its outdated and the baseurl
does not point to https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/
We need to upgrade ibmi-repos package.
yum upgrade ibmi-repos
Then we should also disable the old ibm repo
yum-config-manager --disable ibm
After that the latest coreutils-gnu should be installable!
@abmusse thanks for taking a lok and create to see you and @richardlau moving it forward.
@richardlau
I upgraded ibmi-repos and coreutils-gnu on iOSSBld1.iInTheCloud.com
Ansible changes, including using the correct yum repositories: https://github.com/nodejs/build/pull/3358
We are now using the correct IBM i yum repositories and coreutils-gnu
package.