build icon indicating copy to clipboard operation
build copied to clipboard

Stalled processes not cleared on IBM i

Open richardlau opened this issue 2 years ago • 6 comments

IBM i builds have been failing on test-iinthecloud-ibmi73-ppc64_be-1 since https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/ due to a dangling node process. i.e. https://ci.nodejs.org/job/node-test-commit-ibmi/743/nodes=ibmi73-ppc64/consoleFull

10:22:35 ps awwx | grep Release/node | grep -v grep | cat
10:22:35  38123848      - A     0:25 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99) 
10:22:36 gmake[1]: *** [Makefile:532: test-ci] Error 1

This process is leftover from https://ci.nodejs.org/job/node-test-commit-ibmi/742/nodes=ibmi73-ppc64/ where parallel/test-child-process-exec-abortcontroller-promisified timed out -- the test spawns the process in https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/test/parallel/test-child-process-exec-abortcontroller-promisified.js#L15

The Node.js Makefile is supposed to be able to clear stalled/dangling out/Release/node processes in clear-stalled: https://github.com/nodejs/node/blob/68fb0bf553e2af3e0b61733d29e1e9ba7f73d9b2/Makefile#L460-L466

clear-stalled:
	$(info Clean up any leftover processes but don't error if found.)
	ps awwx | grep Release/node | grep -v grep | cat
	@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
	if [ "$${PS_OUT}" ]; then \
		echo $${PS_OUT} | xargs kill -9; \
	fi

but it looks like on IBM i this isn't killing the process:

-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
 38123848      - A     1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$

If I add some debug into the Makefile I can see that xargs gets the process ID but it looks like kill -9 isn't terminating the process?

-bash-5.1$ git diff
diff --git a/Makefile b/Makefile
index a6549a8474..5bf612a70d 100644
--- a/Makefile
+++ b/Makefile
@@ -463,6 +463,7 @@ clear-stalled:
        @PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
        if [ "$${PS_OUT}" ]; then \
                echo $${PS_OUT} | xargs kill -9; \
+               echo $${PS_OUT} | xargs echo =; \
        fi

 .PHONY: test-build
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$ gmake clear-stalled
Clean up any leftover processes but don't error if found.
ps awwx | grep Release/node | grep -v grep | cat
 38123848      - A     1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
= 38123848
-bash-5.1$ ps -ef | grep out/Release/node
    iojs 38123848        1   0   Apr 26      -  1:18 /home/IOJS/build/workspace/node-test-commit-ibmi/nodes/ibmi73-ppc64/out/Release/node -e setInterval(()=>{}, 99)
-bash-5.1$

@ThePrez Any ideas?

richardlau avatar Apr 29 '22 12:04 richardlau

(I'm assuming we can manually clear the stalled process to get the CI passing but it would be good if the automation in the build scripts just worked.)

richardlau avatar Apr 29 '22 13:04 richardlau

This is very strange, indeed! The phenomenon is easily repeatable by simply running the node -e "setInterval(()=>{}, 99)" in a background job.

Strangely:

  • kill -9 from a bash shell works
  • kill -9 from sh works
  • kill -9 from xargs inside a Makefile does NOT work 👎
  • kill -KILL from a bash shell works
  • kill -KILL from sh works
  • kill -KILL from xargs inside a Makefile works

So an easy fix would be to simply change the Makefile do use -KILL instead of -9. I can't imagine that would cause any issue on other platforms.

Regardless, I'm still trying to figure out root cause. IBM i has two different types of signals: ILE and PASE (Node.js runs in PASE), and the numerical representations differ:

  • PASE SIGKILL = 9
  • ILE SIGIO = 9
  • ILE SIGKILL = 12 But a kill -12 from xargs in the Makefile also fails, so I think that's a "red herring."

ThePrez avatar Apr 29 '22 19:04 ThePrez

~~Regardless, that xargs invocation should have -n 1. Would you like me to open a separate issue for that?~~ oops, no you don't. Disregard!

ThePrez avatar Apr 29 '22 19:04 ThePrez

Regardless, that xargs invocation should have -n 1. Would you like me to open a separate issue for that?

👍

richardlau avatar Apr 29 '22 20:04 richardlau

this works

clear-stalled:
        $(info Clean up any leftover processes but don't error if found.)
        ps awwx | grep Release/node | grep -v grep | cat
        @PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
        if [ "$${PS_OUT}" ]; then \
                kill -9 $${PS_OUT}; \
        fi

as does (as mentioned)

clear-stalled:
	$(info Clean up any leftover processes but don't error if found.)
	ps awwx | grep Release/node | grep -v grep | cat
	@PS_OUT=`ps awwx | grep Release/node | grep -v grep | awk '{print $$1}'`; \
	if [ "$${PS_OUT}" ]; then \
		echo $${PS_OUT} | xargs -t kill -KILL; \
	fi

In my experimentation, it seems that xargs and -9 together are needed to recreate. This makes no sense.

ThePrez avatar Apr 29 '22 21:04 ThePrez

We debugged this today and discovered the root cause turns out to a bug in the GNU kill, ie. /QOpenSys/pkgs/bin/kill. https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=900b5621e685df7ffd001fc64bc9d44b06b13900

This affects using GNU kill with pretty much any numeric value, not just kill -9. As a "workaround", you could use the correct bit pattern for signal 9 on AIX, ie. /QOpenSys/pkgs/bin/kill -589825 pid :joy::joy::joy: Otherwise, you can use the system version of kill at /QOpenSys/usr/bin/kill or use kill -KILL.

I'm working on an update with the fix, but due to some infrastructure issues this won't be available for a while.

kadler avatar Jul 12 '22 22:07 kadler

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

github-actions[bot] avatar May 09 '23 00:05 github-actions[bot]

@abmusse is this something you could take a look at?

mhdawson avatar May 16 '23 18:05 mhdawson

Yes, I'll take a look at this one

abmusse avatar May 16 '23 19:05 abmusse

@mhdawson

Looks like we push the fix up in coreutils-gnu 8.25-9. We have an outdated version on the build system. Likely we just need to run a yum upgrade coreutils-gnu on the build system.

abmusse avatar May 16 '23 19:05 abmusse

@abmusse On test-iinthecloud-ibmi73-ppc64_be-1:

-bash-5.1$ yum info coreutils-gnu
Installed Packages
Name        : coreutils-gnu
Arch        : ppc64
Version     : 8.25
Release     : 6
Size        : 118 M
Repo        : installed
From repo   : ibm
Summary     : GNU coreutils
URL         : https://www.gnu.org/software/coreutils
License     : GPL-3.0-or-later
Description : The GNU Core Utilities are the basic file, shell and text manipulation utilities
            : of the GNU operating system. These are the core utilities which are expected to
            : exist on every operating system.

-bash-5.1$ yum upgrade coreutils-gnu
Setting up Upgrade Process
No Packages marked for Update
-bash-5.1$

richardlau avatar May 17 '23 12:05 richardlau

What repos does this box have?

yum repolist all

We migrated base repos last year. This box may need the ibmi-repos upgrade.

https://ibmi-oss-docs.readthedocs.io/en/latest/yum/IBM_REPOS.html#transition

abmusse avatar May 17 '23 12:05 abmusse

-bash-5.1$ yum repolist all
repo id                                                                                            repo name                                                                                        status
ibm                                                                                                ibm                                                                                              enabled: 1002
ibm-7.3                                                                                            ibm-7.3                                                                                          disabled
ibmi-base                                                                                          IBM i base                                                                                       enabled: 1002
ibmi-release                                                                                       IBM i 7.3                                                                                        enabled:   67
repolist: 2071
-bash-5.1$

richardlau avatar May 17 '23 13:05 richardlau

What url does ibmi-base point to?

cat /QOpenSys/etc/yum/repos.d/ibmi-base.repo

I suspect its outdated and the baseurl does not point to https://public.dhe.ibm.com/software/ibmi/products/pase/rpms/repo-base-7.3/

abmusse avatar May 17 '23 13:05 abmusse

We need to upgrade ibmi-repos package.

yum upgrade ibmi-repos

Then we should also disable the old ibm repo

yum-config-manager --disable ibm

After that the latest coreutils-gnu should be installable!

abmusse avatar May 17 '23 13:05 abmusse

@abmusse thanks for taking a lok and create to see you and @richardlau moving it forward.

mhdawson avatar May 17 '23 13:05 mhdawson

@richardlau I upgraded ibmi-repos and coreutils-gnu on iOSSBld1.iInTheCloud.com

abmusse avatar May 17 '23 15:05 abmusse

Ansible changes, including using the correct yum repositories: https://github.com/nodejs/build/pull/3358

richardlau avatar May 17 '23 21:05 richardlau

We are now using the correct IBM i yum repositories and coreutils-gnu package.

richardlau avatar May 23 '23 16:05 richardlau