s6-overlay icon indicating copy to clipboard operation
s6-overlay copied to clipboard

V3 finish script not executed

Open okaerin opened this issue 2 years ago • 2 comments

I am trying to setup S6 to check preconditions (e.g. is a port/socket connectable) and if they are not met the container should fail/stop. To do so I would like to define a service, ideally a one shot which retries n times. What I tried so far was setting up a long running service with a finish script using s6-permafailon . Somehow that is never triggered. So either this is a bug or it is not quite clear how to setup finish scripts. To reproduce the issue see the Dockerfile:

FROM alpine
# configs
# ==> s6-rc.d/myapp/finish <==
# #!/bin/execlineb -P
# #did it fail 5 times in the last 2 seconds with an exit code betwee 1 and 255
# s6-permafailon 2 5 1-100 

# ==> s6-rc.d/myapp/run <==
# #!/command/execlineb -P
# #check if can connect to rabbit mq
# socat -u /dev/stdin tcp-connect:localhost:5672
# ==> s6-rc.d/myapp/type <==
# longrun

ARG S6_OVERLAY_VERSION=3.1.0.1

RUN apk update && apk add xz socat

ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-noarch.tar.xz /tmp
RUN tar -C / -Jxpf /tmp/s6-overlay-noarch.tar.xz
ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-x86_64.tar.xz /tmp
RUN tar -C / -Jxpf /tmp/s6-overlay-x86_64.tar.xz

#create service
RUN mkdir -p /etc/s6-overlay/s6-rc.d/myapp/
WORKDIR /etc/s6-overlay/s6-rc.d/myapp/

RUN echo -e '#!/bin/execlineb -P\n#did it fail 5 times in the last 2 seconds with an exit code betwee 1 and 255\ns6-permafailon 2 5 1-100' > finish

RUN echo -e '#!/command/execlineb -P\n#check if can connect to rabbit mq\nsocat -u /dev/stdin tcp-connect:localhost:5672\n' > run

RUN echo -e 'longrun' > type

#add to bundle
RUN mkdir -p /etc/s6-overlay/s6-rc.d/user/contents.d
WORKDIR /etc/s6-overlay/s6-rc.d/user/contents.d

RUN touch myapp

WORKDIR /
ENTRYPOINT ["/init"]

okaerin avatar Sep 14 '22 13:09 okaerin

  • I'm on vacation until the end of the month, so I don't have the infrastructure to reproduce at the moment - and please don't expect fast answers.
  • What exactly is never triggered? Is the service started, can you see it in the s6-rc logs?
  • Please note that version 3.1.2.1 is out and fixes some bugs.
  • s6-overlay installs s6-networking, so you don't have to install socat for your test. s6-tcpclient -4 localhost 5672 true should perform the exact same test.
  • Note that for the container to stop automatically, the CMD should fail. A failing supervised service, even in permanent failure mode, will not trigger a container shutdown. If you want a container shutdown, you need to
    • either have your CMD exit
    • or, if you have no CMD, write the container exit code you want to /run/s6-linux-init-container-results/exitcode then call halt.

skarnet avatar Sep 14 '22 17:09 skarnet

Thank you for your prompt answer! Now the findings in the order of your comments:

  • No need to hurry
  • The service was started but the finish script was never executed.
  • I do use now 3.1.2.1
  • Thank you for the hint, but I will also have to check unix sockets whether they are connectable (e.g. from rsyslogd) so thats why I am generally using socat
  • Thanks for the hint, I managed to halt and update the exit code of the container.

Now the only thing left over is that, even though that the service command fails, it reports that the service successfully started. I was thinking that if a service depends on another one and that the dependency is not met, that it wont start. I think somehow I am missing something. Here is the most recent config:


==> s6-rc.d/user/contents.d/myapp <==

==> s6-rc.d/user/contents.d/secondary <==

==> s6-rc.d/myapp/type <==
longrun
==> s6-rc.d/myapp/run <==
#!/command/execlineb -P
#check if can connect to rabbit mq
socat -u /dev/null tcp-connect:localhost:5672

==> s6-rc.d/myapp/finish <==
#!/command/execlineb -S0
#read the failure count
backtick -D0 -E failcnt {
  pipeline { s6-svdt /run/service/myapp/ }
  awk "BEGIN{cnt=0}{if($3!=0)cnt++}END{print cnt}"
}

# s6-echo "myapp fail count $failcnt"
if -X { test $failcnt -gt 5 }
  foreground { redirfd -w 1 /run/s6-linux-init-container-results/exitcode echo 0 }
  /run/s6/basedir/bin/halt
==> s6-rc.d/secondary/type <==
oneshot
==> s6-rc.d/secondary/dependencies.d/myapp <==

==> s6-rc.d/secondary/up <==

And here the console output:

s6-rc: info: service myapp: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service myapp successfully started
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
2022/09/16 10:05:01 socat[32] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
s6-rc: info: service legacy-services successfully started
/ # 2022/09/16 10:05:02 socat[66] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
2022/09/16 10:05:03 socat[71] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
2022/09/16 10:05:04 socat[76] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
2022/09/16 10:05:05 socat[81] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
2022/09/16 10:05:06 socat[86] E connect(6, AF=2 127.0.0.1:5672, 16): Connection refused
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service myapp: stopping
s6-rc: info: service myapp successfully stopped
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped

okaerin avatar Sep 16 '22 10:09 okaerin

I have same issue, the down of oneshot service is not executed.

scuzhanglei avatar Oct 25 '22 08:10 scuzhanglei

@scuzhanglei You'll have to provide more context if you want help. I doubt you're having the exact same issue, given what I've outlined above; please open another issue with details of your problem.

@okaerin Sorry for not getting back to you sooner. So the thing with longruns is, once they're started and have reached readiness at least once, s6-rc considers that starting the service has been successful; if the service dies later on, it's a temporary error, the supervisor is supposed to restart it. And when the service doesn't define readiness (there's no notification-fd file in your service definition directory), it is considered ready as soon as it starts. So, in your case, myapp is considered successfully started as soon as the run script is executed, even if it fails later.

To prevent that, you should make sure myapp is only ready after socat successfully establishes a connection. I don't know how to do that with socat, so here is how I would do it:

==> s6-rc.d/myapp/notification-fd <==
3

==> s6-rc.d/myapp/run <==
#!/command/execlineb -P
redirfd -r 0 /dev/null
redirfd -w 1 /dev/null
s6-tcpclient -DRHl0 localhost 5672
if { fdmove 1 3 s6-echo }
fdclose 3
s6-ioconnect -67

The fdmove 1 3 s6-echo line does it: once s6-tcpclient has established a connection to localhost:5672, it writes a line to fd 3, signaling the supervisor that the service is ready. Then fd 3 is closed and s6-ioconnect maintains a connection between /dev/null and localhost:5672.

Note that all this is a pretty expensive way to check for RabbitMQ readiness. If RabbitMQ is started in this container, then you should write a readiness script for it instead. If it is started in another container, then you should probably have a policy that says this container will not start before RabbitMQ is ready, and have a readiness checker outside of this container.

skarnet avatar Oct 25 '22 12:10 skarnet

thanks for your reply , below is more details. cloud-hypervisor is a lognrun service, virt-prerunner is a oneshot service depend on cloud-hypervisor. I hope if any error happened in virt-prerunner, the container exit. to to this, I add a down for virt-prerunner to stop the container by /run/s6/basedir/bin/halt, but as I see, the down is not executed, when virt-prerunner stoped with an error or not, the container still keep running.

longrun service

/ # cat /etc/s6-overlay/s6-rc.d/cloud-hypervisor/type
longrun
/ # cat /etc/s6-overlay/s6-rc.d/cloud-hypervisor/run
#!/usr/bin/execlineb -P

cloud-hypervisor --api-socket /var/run/ch.sock
/ # cat /etc/s6-overlay/s6-rc.d/cloud-hypervisor/finish
#!/bin/sh

if test "$1" -eq 256 ; then
  e=$((128 + $2))
else
  e="$1"
fi

echo "$e" > /run/s6-linux-init-container-results/exitcode

/run/s6/basedir/bin/halt

oneshot service

/ # cat /etc/s6-overlay/s6-rc.d/virt-prerunner/type
oneshot
/ # cat /etc/s6-overlay/s6-rc.d/virt-prerunner/up
none-exists-command
/ # cat /etc/s6-overlay/s6-rc.d/virt-prerunner/down
/run/s6/basedir/bin/halt
/ # cat /etc/s6-overlay/s6-rc.d/virt-prerunner/dependencies.d/cloud-hypervisor

log

# s6-rc: info: service cloud-hypervisor: starting
# s6-rc: info: service s6rc-oneshot-runner: starting
# s6-rc: info: service cloud-hypervisor successfully started
# s6-rc: info: service s6rc-oneshot-runner successfully started
# s6-rc: info: service fix-attrs: starting
# s6-rc: info: service virt-prerunner: starting
# s6-rc-oneshot-run: fatal: unable to exec none-exists-command: No such file or directory
# s6-rc: warning: unable to start service virt-prerunner: command exited 127
# s6-rc: info: service fix-attrs successfully started
# s6-rc: info: service legacy-cont-init: starting
# s6-rc: info: service legacy-cont-init successfully started

scuzhanglei avatar Oct 26 '22 02:10 scuzhanglei

When the up script for virt-prerunner fails, the service is not started, so it's normal that the down script is never executed.

down scripts are only executed when a service that has been started is being stopped.

As an aside, do not call /run/s6/basedir/bin/halt in a down script. When down scripts are run, it means that the container is already in the process of stopping.

skarnet avatar Oct 26 '22 08:10 skarnet

ENV S6_BEHAVIOUR_IF_STAGE2_FAILS=2

skarnet avatar Oct 26 '22 09:10 skarnet

thanks, it works now.

scuzhanglei avatar Oct 26 '22 09:10 scuzhanglei

@skarnet thanks, it works with the readyness notification

okaerin avatar Oct 31 '22 09:10 okaerin