pcp icon indicating copy to clipboard operation
pcp copied to clipboard

pcp.env: _get_pids_by_name returns all pmcd pids on machine with containers

Open test-account-0 opened this issue 9 years ago • 6 comments

I'm using pmcd on machines with containers which also have pmcd. Theese container are based on openvz and lxc.

With the current version of /etc/pcp.env I cannot start, stop, restart et cetera pmcd (and probably other pcp services).

As a workaround I have changed function _get_pids_by_name from:

_get_pids_by_name()
{
    if [ $# -ne 1 ]
    then
        echo "Usage: _get_pids_by_name process-name" >&2
        exit 1
    fi

    # Algorithm ... all ps(1) variants have a time of the form MM:SS
    # or HH:MM:SS or HH:MM.SS before the psargs field, so we're using
    # this as the search anchor.
    #
    # Matches with $1 (process-name) occur if the first psarg is $1
    # or ends in /$1 or starts ($1) (blame Mac OS X for the last one)
    # ... the matching uses sed's regular expressions, so passing a
    # regex into $1 will work.

    $PCP_PS_PROG $PCP_PS_ALL_FLAGS \
    | sed -n \
        -e 's/$/ /' \
        -e 's/[         ][      ]*/ /g' \
        -e 's/^ //' \
        -e 's/^[^ ]* //' \
        -e "/[0-9][:\.][0-9][0-9]  *[^ ]*\/$1 /s/ .*//p" \
        -e "/[0-9][:\.][0-9][0-9]  *$1 /s/ .*//p" \
        -e "/[0-9][:\.][0-9][0-9]  *($1)/s/ .*//p" \
    # end
}

to:

_get_pids_by_name()
{
    if [ $# -ne 1 ]
    then
        echo "Usage: _get_pids_by_name process-name" >&2
        exit 1
    fi

    # Algorithm ... all ps(1) variants have a time of the form MM:SS
    # or HH:MM:SS or HH:MM.SS before the psargs field, so we're using
    # this as the search anchor.
    #
    # Matches with $1 (process-name) occur if the first psarg is $1
    # or ends in /$1 ... the matching uses sed's regular expressions,
    # so passing a regex into $1 will work.

    pids=$($PCP_PS_PROG $PCP_PS_ALL_FLAGS \
    | sed -n \
        -e 's/$/ /' \
        -e 's/[         ][      ]*/ /g' \
        -e 's/^ //' \
        -e 's/^[^ ]* //' \
        -e "/[0-9][:\.][0-9][0-9]  *[^ ]*\/$1 /s/ .*//p" \
        -e "/[0-9][:\.][0-9][0-9]  *$1 /s/ .*//p")

    if [ -n "$pids" ]; then
        for pid in $pids; do
            ppid=$($PCP_PS_PROG -p "$pid" -o ppid --no-headers | awk '{print $1}')
            if [ "$ppid" = 1 ]; then
                echo "$pid"
            fi
        done
    fi
    unset pids
}

It echoes only those pids that a parent is pid 1. I'm not sure if it is good/compatilble enough to make a pull request. It is working on wheezy, jessie, precise, trusty and xenial.

test-account-0 avatar Aug 02 '16 11:08 test-account-0

Can you explain why you're running pmcd servers inside multiple containers? One running on the host should be able to give intra-container data for those other systems (before too long, if not already).

fche avatar Aug 02 '16 17:08 fche

Because I want to monitor metrics from pcp with a nagios script. These are not docker containers but kind of "normal machines" (one can ssh to them, install stuff and such). I already have automation to set up services and monitoring per host. I haven't yet checked monitoring containers from outside with pcp, but it would look weird in nagios - all checks on one machine for all containers on this machine instead of on a specific container. And also it is a lot of work to change/write new puppet modules to make it happen automatically.

Apart from that, why the function _get_pids_by_name gets all pids of a specific program and not only those that a parent is pid 1? Is it used not only in init scripts?

test-account-0 avatar Aug 03 '16 07:08 test-account-0

Apart from that, why the function _get_pids_by_name gets all pids of a specific program and not only those that a parent is pid 1? Is it used not only in init scripts?

Its used almost exclusively from the init scripts; there's a corner case in that its used from a number of the test scripts. However, that's something that could be resolved (e.g. via separate shell function), and its arguably desirable to restrict the set of processes the init scripts lookup in this way too, as often they are being considered as targets for kill(1) - IOW, maybe this change would be beneficial anyway.

@kmcdonell are there any other situations you know of where _get_pids_by_name() might not be looking for init-parented processes?

natoscott avatar Aug 04 '16 22:08 natoscott

gentle ping

test-account-0 avatar Nov 17 '16 15:11 test-account-0

Not quite 4 years late ... oops (looking at the issues in reverse order for a change). All of the pmcd, pmproxy, pmie and pmlogger processes will be parented by init/systemd and that's most of the use cases. But there are some other nasty ones that make this a non-starter, or at least requires a clever plan ([tm] Black Adder):

  1. pmsignal
  2. pmcd's rc script when looking for PMDAs
  3. _pmda_cull in pmdaproc.sh, again looking for PMDAs
  4. pmlogger_daily checks for sh(1) (bizarre concurrent execution check)

kmcdonell avatar Jul 14 '20 07:07 kmcdonell

hm, I just ran into this one too with the PCP container (https://src.fedoraproject.org/container/pcp) - once I start the container, I cannot (re)start pmproxy on the host system, because the host system sees two pmproxies running. Restarting pmcd works fine however.

andreasgerstmayr avatar Jul 21 '20 19:07 andreasgerstmayr