securityonion icon indicating copy to clipboard operation
securityonion copied to clipboard

FIX: Curator action skipped when lastPID is reused by active process

Open petiepooo opened this issue 1 year ago • 2 comments

ISSUE: A long-running process just happened to start on a PID that was last run as the curator delete action. Due to the simple check at https://github.com/Security-Onion-Solutions/securityonion/blob/c949101d0f77cecee5949411ceb8c8bf52ef9306/salt/curator/files/bin/so-curator-delete#L24 the so-curator-delete script would see that the /proc directory exists, assume it is the prior action still running, and all further deletes due to space constraints would stop. Updates to curator.log for the action simply stopped and elastic index storage started growing without bounds. Restarting so-curator did not fix anything, and most output on the launch scripts and cron jobs are redirected to null which made for a challenging couple hours of debugging.

SOLUTION: I would suggest replacing that line and it's comment with this:

# if the process running at $lastPID is the same script as us, exit sed -zne 2p /proc/$lastPID/cmdline 2>/dev/null | grep -q $0 && exit

GNU sed's -z allows null-terminated lines (which cmdline uses), so sed just prints the second line, which is the bash script name (first line is /bin/bash). Redirecting stderr avoids output if the proc dir doesn't exist (grep won't match empty output, so that's fine). $0 is the currently running script, so matching that verifies that the process running at the PID in the lockfile is actually the same script, not just any process (-q suppresses the output if it does match).

VALIDATION: To simulate the issue, just paste a long-running process's PID into /tmp/delete-pidLockFile, say by running

echo 1 | sudo tee /tmp/delete-pidLockFile

and then monitor /opt/so/log/curator/curator.log to see the delete action is no longer completing every five minutes when cron starts it. You can also run

sudo bash -x /usr/sbin/so-curator-delete

to see which lines of the script are run. Patching so-curator-delete as above should overcome that block and start running delete actions again. To test the matching works, you can run (as root)

so-curator-delete 2>run1 & sleep 0.1; so-curator-delete 2>run2

run1 should complete, but run2 should block. You can add an echo before the exit to help with validation. Or for a more natural test, add a sleep before the docker call, wait for cron to launch, SIGSTOP the sleep, and wait for cron to launch again, verify it didn't, then SIGCONT the stop so the first launch completes normally.

NOTES: I am running on v2.3.230 installed onto Ubuntu 20.04 Server LTS, but I verified the simple check still exists in the main branch and it is not OS dependent so I assume it exists in v2.4 as well.

This can happen to other curator actions, like close, closed-delete, and cluster-{close,delete,warm}. Please consider patching them at the same time.

petiepooo avatar May 30 '23 19:05 petiepooo

This may be an explanation for #7648 and other occasional disk usage woes. A reboot would clear /tmp and fix things, but we're not "The IT Crowd" here...

petiepooo avatar Jun 01 '23 18:06 petiepooo

This may be further complicated by the fact that rebooting may not actually clear /tmp under some circumstances. Because of this, we're considering taking /tmp and the PID out of the equation altogether by using pgrep to ensure only one instance can run at a time.

In terms of priorities, all of our time and effort right now is focused on finishing up Security Onion 2.4, so I can't provide any estimates for when we might get around to working on this.

dougburks avatar Jun 01 '23 20:06 dougburks