HPCC-Platform icon indicating copy to clipboard operation
HPCC-Platform copied to clipboard

HPCC-29093 Stop Thor killing slaves of other Thors

Open jakesmith opened this issue 1 year ago • 5 comments

Use of killall in init script at start/stop, could match slaves from other slaves on same box, causing other Thor's to die.

killall --exact does not prevent this matching with a long enough common process name (e.g. thorslave_mythor and thorslave_mythor2) Switch to using pidof and kill.

Type of change:

  • [x] This change is a bug fix (non-breaking change which fixes an issue).
  • [ ] This change is a new feature (non-breaking change which adds functionality).
  • [ ] This change improves the code (refactor or other change that does not change the functionality)
  • [ ] This change fixes warnings (the fix does not alter the functionality or the generated code)
  • [ ] This change is a breaking change (fix or feature that will cause existing behavior to change).
  • [ ] This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • [x] My code follows the code style of this project.
    • [ ] My code does not create any new warnings from compiler, build system, or lint.
  • [x] The commit message is properly formatted and free of typos.
    • [ ] The commit message title makes sense in a changelog, by itself.
    • [ ] The commit is signed.
  • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly, or...
    • [ ] I have created a JIRA ticket to update the documentation.
    • [ ] Any new interfaces or exported functions are appropriately commented.
  • [x] I have read the CONTRIBUTORS document.
  • [x] The change has been fully tested:
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    • [ ] I have checked that this change does not introduce memory leaks.
    • [ ] I have used Valgrind or similar tools to check for potential issues.
  • [x] I have given due consideration to all of the following potential concerns:
    • [ ] Scalability
    • [ ] Performance
    • [ ] Security
    • [ ] Thread-safety
    • [ ] Cloud-compatibility
    • [ ] Premature optimization
    • [ ] Existing deployed queries will not be broken
    • [ ] This change fixes the problem, not just the symptom
    • [ ] The target branch of this pull request is appropriate for such a change.
  • [x] There are no similar instances of the same problem that should be addressed
    • [ ] I have addressed them here
    • [ ] I have raised JIRA issues to address them separately
  • [ ] This is a user interface / front-end modification
    • [ ] I have tested my changes in multiple modern browsers
    • [ ] The component(s) render as expected

Smoketest:

  • [ ] Send notifications about my Pull Request position in Smoketest queue.
  • [ ] Test my draft Pull Request.

Testing:

jakesmith avatar Mar 02 '23 19:03 jakesmith

https://track.hpccsystems.com/browse/HPCC-29093 Jira not updated (user does not match)

github-actions[bot] avatar Mar 02 '23 19:03 github-actions[bot]

@Michael-Gardner - closing to run tests in this context (as failing atm)

jakesmith avatar Mar 03 '23 12:03 jakesmith

https://track.hpccsystems.com/browse/HPCC-29093 Jira not updated (user does not match)

github-actions[bot] avatar Mar 03 '23 14:03 github-actions[bot]

It looks like the change is causing the slave process not to be killed.

The regression query test fails to start, because Thor restarted (not new in itself here), but the new slave ran into problems, because the old slave(s) had not terminated (had not been terminated properly by init_thorslave(stop_slaves).

This is the evidence that the slave old + new slave were co-existing (this is from logs in the captured artifact (from 'regression-run-logs-thor-_i_l___ecl artifact from here: https://github.com/hpcc-systems/HPCC-Platform/actions/runs/4323622200), thorslave log :

00018565 USR 2023-03-03 13:27:55.292  8229  8229 "QueryDone, removed W20230303-132723graph1 from jobs"
00018566 USR 2023-03-03 13:28:25.290  8229  8280 "ERROR: 4: slavmain.cpp(1495) : Unexpected process termination (ep:10.1.0.83:20000)"
00000000 PRG 2023-03-03 13:28:25.474 83896 83896 "Opened log file /home/runner/work/HPCC-Platform/HPCC-Platform/install/var/log/HPCCSystems/mythor/thorslave.1.2023_03_03.log"
00000001 PRG 2023-03-03 13:28:25.474 83896 83896 "Build dummytag"
00018567 PRG 2023-03-03 13:28:25.522  8229  8229 "Stopped watchdog"
00018568 PRG 2023-03-03 13:28:26.454  8229  8275 "SYS: LPT=27 APT=0 PU=  1% MU= 17% MAL=2476457984 MMP=2427019264 SBK=49438720 TOT=2428368K RAM=2022672K SWP=516K"
00018569 PRG 2023-03-03 13:28:26.454  8229  8275 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.0 kw/s=0.0 bsy=0 [sdb] r/s=0.8 kr/s=16.2 w/s=7.3 kw/s=375.6 bsy=0 NIC: [eth0] rxp/s=3.3 rxk/s=0.8 txp/s=4.0 txk/s=1.3 rxerrs=0 rxdrps=0 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=98"
0001856A USR 2023-03-03 13:28:27.526  8229  8229 "ERROR: 4: thslavemain.cpp(573) : ThorSlave : Unexpected process termination (ep:10.1.0.83:20000)"
0001856B PRG 2023-03-03 13:28:27.526  8229  8229 "RoxieMemMgr: releasing heap"
0001856C PRG 2023-03-03 13:28:27.526  8229  8229 "Unregistering slave : 10.1.0.83:20100"

jakesmith avatar Mar 03 '23 14:03 jakesmith

@Michael-Gardner - see above comment - can you see why kill with pids change is seemingly failing to clear the processes? NB: it is logging the ${slavepids} in the init_thorslave log (also in the artifact)

jakesmith avatar Mar 03 '23 14:03 jakesmith