containerpilot icon indicating copy to clipboard operation
containerpilot copied to clipboard

documentation on stopTimeout field is confusing

Open JeromeGe opened this issue 8 years ago • 12 comments

containerpilot --version Version: 3.3.4 GitHash: f7d6212

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.3 LTS Release: 14.04 Codename: trusty

uname -a Linux 0743e8140205 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

{
  consul: "127.0.0.1:8500",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "test-stopTimeout",
      exec: "/root/test",
      stopTimeout: "120s",
      restarts: "unlimited",
    }
  ],
  control: {
    socket: "/var/run/containerpilot.socket"
  }
}

test.go

package main

import "fmt"
import "os"
import "os/signal"
import "syscall"
import "time"

func main() {
        sigs := make(chan os.Signal, 1)
        done := make(chan bool, 1)
        signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
        go func() {
                fmt.Println("start")
                sig := <-sigs
                fmt.Println(sig)
                for i := 0; i < 30; i++ {
                        time.Sleep(1 * time.Second)
                        fmt.Println(i)
                }
                done <- true
        }()
        <-done
        fmt.Println("exiting")
}
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Shutdown global}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopping control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: completed graceful shutdown of control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopped serving at /var/run/containerpilot.socket" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopping test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopped test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="killing all processes in 5 seconds" 
time="2017-08-16T18:02:03+08:00" level=debug msg=test-stopTimeout.term 
time="2017-08-16T18:02:03+08:00" level=debug msg="terminating command 'test-stopTimeout' at pid: 4746" 
time="2017-08-16T18:02:03+08:00" level=info msg=terminated job=test-stopTimeout 
time="2017-08-16T18:02:04+08:00" level=info msg=0 job=test-stopTimeout 
time="2017-08-16T18:02:05+08:00" level=info msg=1 job=test-stopTimeout 
time="2017-08-16T18:02:06+08:00" level=info msg=2 job=test-stopTimeout 
time="2017-08-16T18:02:07+08:00" level=info msg=3 job=test-stopTimeout 
time="2017-08-16T18:02:08+08:00" level=info msg="killing processes for job "test-stopTimeout"" 
time="2017-08-16T18:02:08+08:00" level=debug msg=test-stopTimeout.kill 
time="2017-08-16T18:02:08+08:00" level=debug msg="killing command 'test-stopTimeout' at pid: 4746"

The stopTimeout field is set to 120s, but the test-stopTimeout program is killed after 5 seconds test code: https://github.com/JeromeGe/test-containerpilot-stopTimeout

JeromeGe avatar Aug 16 '17 10:08 JeromeGe

I think there's a little work required around documenting use cases of stopTimeout.

    // these fields interact with 'when' behaviors (see below)
    timeout: "300s",
    stopTimeout: "10s",
    restarts: "unlimited",

@tgross should add to this but stopTimeout really matters when you want a grace period before a job throws it's stopping event (before a watcher job begins), not the time until it is stopped. The documentation only notes this with the following when example...

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

I'll try to get a better use case/example into the docs.

jwreagor avatar Aug 16 '17 20:08 jwreagor

Another way of putting this is that stopTimeout is the maximum amount of time a stopping job will wait for another job that might be watching for the stopping event.

But the job that's watching for the stopping event can take however long it wants to do that work.

So if you want to make sure that other job is going to timeout too, then you need to add the timeout field to that job like this:

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    timeout: "5s",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

tgross avatar Aug 16 '17 20:08 tgross

@tgross I need to set the job graceful shutdown time, 5 seconds is too small, is it supported?

JeromeGe avatar Aug 18 '17 02:08 JeromeGe

We need something like docker stop --time or kubectkubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

inetfuture avatar Aug 18 '17 02:08 inetfuture

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

If user want to guarantee some downstream job gets executed, they should set the kill timeout of upstream job explicitly.

inetfuture avatar Aug 18 '17 02:08 inetfuture

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

That will be problematic in a non-Docker environment though.

tgross avatar Aug 18 '17 12:08 tgross

Released in https://github.com/joyent/containerpilot/releases/tag/3.4.0

tgross avatar Aug 18 '17 14:08 tgross

We need something like docker stop --time or kubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

@inetfuture Some time next week I'll take a look and see what this type of feature would require.

jwreagor avatar Aug 19 '17 01:08 jwreagor

@inetfuture @JeromeGe I had a chance to check out the behavior around the default "5s" timeout and found that we already have a configuration option to change this. It appears ContainerPilot has a global stopTimeout setting across all jobs. I could be wrong but let me know if the following configuration achieves the functionality you're looking for.

Note the second line in the config below...

{
  consul: "localhost:8500",
  stopTimeout: "15",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "main-process",
      exec: "/ubuntu/test-daemon -p 5 -t 30",
      restarts: "unlimited",
    },
    {
      name: "main-process2",
      exec: "/ubuntu/test-daemon -p 2 -t 20",
      restarts: "unlimited",
    },
  ],
  control: {
    socket: "/tmp/cp-simple.socket",
  },
}

Using a similar test program as provided by this issue's original author, I've set my job exec processes to pause for shutdown after 30s and another after 20s. A global stopTimeout of "15" forces all job processes to terminate after 15s.

Here's the log output of this configuration's behavior.
DEBU[0000] loaded config: {"Discovery":{},"LogConfig":{"level":"DEBUG","format":"text","output":"stdout"},"StopTimeout":15,"Jobs":[{"Name":"main-process","Exec":"/ubuntu/test-daemon -p 5 -t 30","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited",
"StopTimeout":"","When":{"Frequency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null},{"Name":"main-process2","Exec":"/ubuntu/test-daemon -p 2 -t 20","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited","StopTimeout":"","When":{"Fr
equency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null}],"Watches":null,"Telemetry":null,"Control":{"SocketPath":"/tmp/cp-simple.socket"}}
DEBU[0000] control: initialized router for control server
DEBU[0000] control: listening to /tmp/cp-simple.socket
DEBU[0000] event: {Startup global}
DEBU[0000] main-process2.Run start
INFO[0000] control: serving at /tmp/cp-simple.socket
DEBU[0000] main-process.Run start
INFO[0002] 01                                            job=main-process2
INFO[0002] start w/ timeout 20                           job=main-process2
INFO[0005] 01234                                         job=main-process
INFO[0005] start w/ timeout 30                           job=main-process
^CDEBU[0007] event: {Shutdown global}
DEBU[0007] event: {Stopping main-process2}
DEBU[0007] event: {Stopped main-process2}
DEBU[0007] main-process2.term
DEBU[0007] terminating command 'main-process2' at pid: 60399
DEBU[0007] control: stopping control server
DEBU[0007] event: {Stopping main-process}
DEBU[0007] event: {Stopped main-process}
DEBU[0007] killing all processes in 15 seconds
DEBU[0007] main-process.term
DEBU[0007] terminating command 'main-process' at pid: 60400
DEBU[0007] control: completed graceful shutdown of control server
INFO[0007] terminated                                    job=main-process2
DEBU[0007] control: stopped serving at /tmp/cp-simple.socket
INFO[0007] terminated                                    job=main-process
INFO[0008] 0                                             job=main-process2
INFO[0008] 0                                             job=main-process
INFO[0009] 1                                             job=main-process
INFO[0009] 1                                             job=main-process2
INFO[0010] 2                                             job=main-process
INFO[0010] 2                                             job=main-process2
INFO[0011] 3                                             job=main-process2
INFO[0011] 3                                             job=main-process
INFO[0012] 4                                             job=main-process2
INFO[0012] 4                                             job=main-process
INFO[0013] 5                                             job=main-process2
INFO[0013] 5                                             job=main-process
INFO[0014] 6                                             job=main-process
INFO[0014] 6                                             job=main-process2
INFO[0015] 7                                             job=main-process
INFO[0015] 7                                             job=main-process2
INFO[0016] 8                                             job=main-process
INFO[0016] 8                                             job=main-process2
INFO[0017] 9                                             job=main-process
INFO[0017] 9                                             job=main-process2
INFO[0018] 10                                            job=main-process
INFO[0018] 10                                            job=main-process2
INFO[0019] 11                                            job=main-process
INFO[0019] 11                                            job=main-process2
INFO[0020] 12                                            job=main-process2
INFO[0020] 12                                            job=main-process
INFO[0021] 13                                            job=main-process
INFO[0021] 13                                            job=main-process2
INFO[0022] killing processes for job "main-process"
DEBU[0022] main-process.kill
DEBU[0022] killing command 'main-process' at pid: 60400
INFO[0022] killing processes for job "main-process2"
DEBU[0022] main-process2.kill
DEBU[0022] killing command 'main-process2' at pid: 60399

I'm re-opening this issue so I'll remember to update the documentation to reflect this use case at the global/application level.

jwreagor avatar Aug 20 '17 15:08 jwreagor

By our test, if we set the global stopTimeout to 120s, containerpilot won't exit until the timeout, even when all jobs have exited before the timeout. This behavior is not ideal, the timeout should be insurance for slowly-terminated jobs, not make containerpilot termination always that long.

inetfuture avatar Aug 21 '17 06:08 inetfuture

I agree that it's not ideal, but it's a bit of a tricky situation. At the point where the top-level stop timeout is hit, we're in the following condition:

  • all jobs have received the shutdown event
  • all job exec have received SIGTERM
  • all jobs have deregistered themselves from the event bus
  • the event bus has no more subscribers and so it says "ok I'm done"

So at this point, we should be able to exit immediately, but in practice SIGTERM can take an arbitrarily long amount of time to complete for any of the exec. (By way of example, if you're running a Consul agent with leaveOnTerminate set, then it actually has to complete a network request during shutdown!) There's no way at that point for the child processes to communicate back to us that SIGTERM has been completed.

I think we might be able to improve this slightly via https://github.com/joyent/containerpilot/issues/476.

tgross avatar Aug 21 '17 12:08 tgross

This is issue is still open because #476 (and PR #512) might bring about the necessary plumbing to support this. When working on #512 I did uncover the area where this could be applied however I never did anything after that. No promises, but worth exploring.

If I have the time I'll open a new proposal issue since this issue is more relevant to how CP works today.

jwreagor avatar Oct 05 '17 15:10 jwreagor