shifter icon indicating copy to clipboard operation
shifter copied to clipboard

Shifter processes not responsive to terminate, interrupt signals on interactive systems

Open alanm-cray opened this issue 8 years ago • 8 comments

This was found with final testing of the CL_SHFTR_term_longer_ok test. The test starts a shifter session which it then terminates before it completes. The test then proceeds to attempt the same session again, allowing it to complete. The test is intended to confirm that a terminated shifter session does not leave the system in a state where another following session will work.

When I run this test on systems with a WLM (PBS or MT, confirmed), the test works as expected; the session terminates immediately, and the next session works as expected, too. However, if I attempt this test on an interactive system, shifter does not appear to acknowledge the TERM signal at all; it ends up completing the session.

Steps to Recreate: Pre-step 1. Log into an interactive system. Pre-step 2. Make sure 'ostesthub.us.cray.com/ostest/ubuntu-delayed:latest' is pulled down to the image gateway.

Method 1: Run: ubrun -t -P -T CL_SHFTR_term_longer_ok -M shifter -e OSTEST _CL OSTEST_CL=/tests/shifter/test_shifter -T CL_SHFTR_term_longer_ok interrupt_c leanup_allow_run_ok

Method 2:

  1. Run: aprun -n 1 -b shifter --image=docker:ostesthub.us.cray.com/ostest/ubuntu-delayed:latest /delayed.sh
  2. Wait for 'start sleep with delay 90' to appear.
  3. Wait a few seconds (<30) and do a control-C. Expected: The command prompt returns immediately after aprun reports "Caught signal Interrupt, sending to application". Actual: The command prompt does not return until the full 90 seconds has passed, and the 3 lines from /etc/lsb-release from the image are returned.

Method 3:

  1. Start a second login to the same system. 'su' to root.
  2. Run: aprun -n 1 -b shifter --image=docker:ostesthub.us.cray.com/ostest/ubuntu-delayed:latest /delayed.sh
  3. Wait for 'start sleep with delay 90' to appear.
  4. Run 'apstat' on the second login. Wait a few seconds (<30), then using the Apid reported from 'apstat', run: apstat -15 Expected: The command prompt returns immediately after aprun reports "Caught signal Terminate, sending to application". Actual: The command prompt does not return until the full 90 seconds has passed, and the 3 lines from /etc/lsb-release from the image are returned.

As stated above, this behavior does not occur when a WLM is involved. Also, if I use 'aprun -n 1 -b cpubound -a 90' as the session command instead, all interrupts or terminates respond immediately.

If I use 'apkill -9 ' instead, this does immediately kill the shifter session. Still, this seems pretty drastic.

Considering that most, if not all, customers run with some kind of WLM, I don't feel this is feature critical, at this time. Also, considering that a user should be able to send a KILL signal if they really needed to stop the session, that would be possible. If this wasn't, I would consider pushing this up to 'urgent' as it would then involve an adminstrator or operator to run 'apkill'.

alanm-cray avatar Sep 15 '16 14:09 alanm-cray

Hi Alan,

Could you please confirm which commit-level you're seeing this on?

I suspect this may be fixed in 504742514460f97755d0b805be86a6e2302019dc from August 17.

Thanks, Doug

dmjacobsen avatar Sep 15 '16 16:09 dmjacobsen

We will attempt to reproduce this with the most recent Shifter code, specifically 5047425. The code refresh occurred on Sept 13th so it seems likely that we did not have commit 5047425.

alanm-cray avatar Sep 22 '16 18:09 alanm-cray

Please consider looking at c7b22c4fcf8fa63b4741b346e0bfb61d2e38b4e5 (yesterday); it doesn't have relevance on specifically on this matter but corrects a number of other items.

dmjacobsen avatar Sep 22 '16 18:09 dmjacobsen

I still see the problem testing against the 16.08.3 release. Apologies for taking so long to check this.

cdm-work avatar Dec 21 '16 17:12 cdm-work

Did this get fixed?

scanon avatar Mar 25 '18 18:03 scanon

Is there a new release in which it is believed to be fixed?

cdm-work avatar Mar 26 '18 15:03 cdm-work

We are working on a release now, but I"m not sure if this was addressed or not. @dmjacobsen?

scanon avatar Mar 26 '18 16:03 scanon

shifter explicitly ignores sigterm/sigint/sighup/sigstop during container setup, however, it restores the original handlers (whatever they might have been) immediately before execution of the targeted process.

during interactive setup (i.e., without wlm integration), the only difference is that because it needs to fork/exec mount at one point, shifter is forced to change its real uid to 0, which later is then fully changed to the user. I suspect that the user's aprun is not permitted to send signals to shifter once the real uid was changed to 0, e.g., ftp://ftp.gnu.org/old-gnu/Manuals/glibc-2.2.3/html_chapter/libc_24.html

The only common exception is when you run a setuid program in a child process; if the program changes its real UID as well as its effective UID, you may not have permission to send a signal. The su program does this.

Some of our future plans may fix this behavior. But no the new release does not address this.

dmjacobsen avatar Apr 07 '18 06:04 dmjacobsen