puppet-on-openwrt icon indicating copy to clipboard operation
puppet-on-openwrt copied to clipboard

puppet process not terminated when a puppet runs completes

Open m8ram opened this issue 10 years ago • 6 comments

I have disabled the puppet service and I am triggering puppet runs manually for now.

Even after the transaction is reported completed a ruby process remains:

root@OpenWrt:~# ps w|grep pup
 1686 root      1492 S    grep pup
root@OpenWrt:~# 
root@OpenWrt:~# puppet agent --server=puppet --onetime --debug
/usr/lib/ruby/1.9/yaml.rb:56:in `<top (required)>':
It seems your ruby installation is missing psych (for YAML output).
To eliminate this warning, please install libyaml and reinstall your ruby.
Debug: Failed to load library 'selinux' for feature 'selinux'
Debug: Using settings: adding file resource 'confdir': 'File[/etc/puppet]{:path=>"/etc/puppet", :ensure=>:directory, :loglevel=>:debug, :links=>:follow, :backup=>false}'
Debug: Failed to load library 'shadow' for feature 'libshadow'
Debug: Failed to load library 'ldap' for feature 'ldap'
...
Debug: /File[/etc/puppet/state/last_run_summary.yaml]: Autorequiring File[/etc/puppet/state]
Debug: /File[/etc/puppet/state/last_run_report.yaml]: Autorequiring File[/etc/puppet/state]
Debug: /File[/etc/puppet/state/graphs]: Autorequiring File[/etc/puppet/state]
Debug: /File[/var/log/puppet]/ensure: created
Debug: /File[/var/run/puppet]/ensure: created
Debug: Finishing transaction 8656680

root@OpenWrt:~# date; ps w|grep pup; ls -ltr /etc/puppet/state/
Thu Apr 24 21:01:42 CEST 2014
 1757 root     32900 S    ruby /usr/bin/puppet agent --server=puppet --onetime --debug
 1957 root     32900 S    ruby /usr/bin/puppet agent --server=puppet --onetime --debug
 1966 root      1492 S    grep pup
drwxr-xr-x    2 root     root          1024 Apr 20 17:21 graphs
-rw-rw----    1 root     root         27629 Apr 24 20:51 state.yaml
-rw-r--r--    1 root     root           432 Apr 24 20:51 last_run_summary.yaml
-rw-r-----    1 root     root         21197 Apr 24 20:51 last_run_report.yaml
-rw-r--r--    1 root     root             4 Apr 24 21:00 agent_catalog_run.lock
-rw-r-----    1 root     root           231 Apr 24 21:01 resources.txt
-rw-r-----    1 root     root            44 Apr 24 21:01 classes.txt

The lock file is removed several minutes later but even then one process is still running:

root@OpenWrt:~# date; ps w|grep pup; ls -ltr /etc/puppet/state/
Thu Apr 24 21:07:35 CEST 2014
 1957 root     32900 S    ruby /usr/bin/puppet agent --server=puppet --onetime --debug
 1997 root      1492 S    grep pup
drwxr-xr-x    2 root     root          1024 Apr 20 17:21 graphs
-rw-r-----    1 root     root           231 Apr 24 21:01 resources.txt
-rw-r-----    1 root     root            44 Apr 24 21:01 classes.txt
-rw-rw----    1 root     root         27629 Apr 24 21:06 state.yaml
-rw-r--r--    1 root     root           443 Apr 24 21:06 last_run_summary.yaml
-rw-r-----    1 root     root         21206 Apr 24 21:06 last_run_report.yaml

m8ram avatar Apr 24 '14 19:04 m8ram

Hmm. I'm not sure :(

Is there anything sensitive in the debug? If not, can you give me a full pastebin or gist?

Can you see if this happens with an empty catelog? Also does it happen when you just use something like:

puppet apply -e "notify { bla:}"

Lets see if we can narrow it down. Maybe an exec or something that gets orphaned?

solarkennedy avatar Apr 25 '14 03:04 solarkennedy

Hi,

I may have a clue how this happens. It looks like puppet has severe problems on fork. Any exec may leak a puppet instance....

I have a WNDR3800 where this happens all the time... I'm combining the following workaround:

  • Make retry and timeouts work (see https://tickets.puppetlabs.com/browse/PUP-2299 for details)
  • Patch puppet/util.rb to sleep for 0.5 seconds before running the target code block
  • Watching for puppet childs that run longer than 1 minute and auto-kill them with the following script (triggered by cron)
#!/bin/ash

OLDPIDS=$(cat /tmp/puppet.pids)
echo > /tmp/puppet.pids

PIDS=$(grep -H ruby /proc/[0-9]*/cmdline|sed 's:/proc/\([0-9]*\)/.*:\1:'|xargs echo)

for PID in $PIDS; do
  PPID=$(cat /proc/$PID/status |grep PPid|sed 's:[^0-9]*\([0-9]\)[^0-9]*:\1:g')
  if [ -d /proc/$PID ] && echo $PIDS | grep $PPID ; then
    if echo $OLDPIDS | grep $PID; then
      kill -9 $PID
    else
      echo $PID >> /tmp/puppet.pids
    fi
  fi
done

This makes it work ok-ish. I'm still getting zombies, but no full processes.

Those workaround do work somehow and I can 100% confirm that it is related to puppet/util.rb:safe_posix_fork (actual forking) and util/execution.rb:execute waiting for the pid to end (Process.waitpid2). The Process ruby doc has some hints (like setting WNOHANG), but I can't get it to work in a 100% reliable manner. I guess s.o. with better ruby knowledge could fix it??

rtreffer avatar May 05 '14 21:05 rtreffer

That said it should be possible to reproduce with a

export CAT=$(for i in $(seq 20); do echo "exec {'c$i': command => '/sbin/uci get network.loopback.ifname', timeout => '10'}" ; done) ; puppet apply -e "$CAT"

This will cause 20 execs and should trigger the problem.

rtreffer avatar May 05 '14 21:05 rtreffer

Gosh. I'm really not sure what to do about this.

As another workaround, does provider=>shell work better? Or does it just result in even more zombies?

solarkennedy avatar May 07 '14 21:05 solarkennedy

I found provider => shell to be worse....

@solarkennedy can you reproduce the problem?

I can not reproduce the problem on non-openwrt machines (I've just clocked my laptop at 800MHz and everything worked just fine)

I'm currently guessing that it might be related to uclibc... Google reveales several historic problems related to uclibc and forks.

Testing that might take quite some time, though....

rtreffer avatar May 20 '14 20:05 rtreffer

I've tried to reproduce this in a docker image, and failed, but that doesn't really mean much.

I can try to reproduce it on real hardware this weekend.

Even if I did, I'm not sure I could help actually solve this :(

solarkennedy avatar May 22 '14 15:05 solarkennedy