faasd
faasd copied to clipboard
Stopping faasd and faasd-provider should stop all containers
Expected Behaviour
Stopping faasd and faasd-provider should stop all containers. When the processes faasd and faasd-provider are stopped (with sysctl), it's expected that all containers should be stopped.
Current Behaviour
Some container tasks remain in RUNNING state.
Before stop, all tasks are running:
❯ sc-status faasd-provider
● faasd-provider.service - faasd-provider
Loaded: loaded (/lib/systemd/system/faasd-provider.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-03-09 08:55:47 EDT; 52s ago
Main PID: 15197 (faasd)
Tasks: 8 (limit: 4700)
Memory: 11.3M (limit: 500.0M)
CGroup: /system.slice/faasd-provider.service
└─15197 /usr/local/bin/faasd provider
❯ sc-status faasd
● faasd.service - faasd
Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-03-09 08:55:47 EDT; 54s ago
Main PID: 15189 (faasd)
Tasks: 10 (limit: 4700)
Memory: 14.5M (limit: 500.0M)
CGroup: /system.slice/faasd.service
└─15189 /usr/local/bin/faasd up
❯ sudo ctr task ls
TASK PID STATUS
basic-auth-plugin 15234 RUNNING
nats 15333 RUNNING
prometheus 15450 RUNNING
gateway 15577 RUNNING
queue-worker 15698 RUNNING
❯ sudo ctr -n openfaas-fn task ls
TASK PID STATUS
figlet 15957 RUNNING
After service stop:
❯ sc-stop faasd-provider
❯ sc-stop faasd
❯ sc-status faasd
● faasd.service - faasd
Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-03-09 08:57:58 EDT; 2min 46s ago
Process: 15189 ExecStart=/usr/local/bin/faasd up (code=exited, status=0/SUCCESS)
Main PID: 15189 (code=exited, status=0/SUCCESS)
❯ sc-status faasd-provider
● faasd-provider.service - faasd-provider
Loaded: loaded (/lib/systemd/system/faasd-provider.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-03-09 08:57:23 EDT; 3min 23s ago
Process: 15197 ExecStart=/usr/local/bin/faasd provider (code=killed, signal=TERM)
Main PID: 15197 (code=killed, signal=TERM)
❯ sudo ctr -n openfaas-fn task ls
TASK PID STATUS
figlet 15957 RUNNING
❯ sudo ctr task ls
TASK PID STATUS
prometheus 15450 STOPPED
gateway 15577 RUNNING
queue-worker 15698 RUNNING
As seen, the deployed function, figlet and the containers gateway and queue-worker still had their tasks running.
Below the stop logs:
Mar 09 09:16:51 debian10 faasd[16313]: 2020/03/09 09:16:51 Signal received.. shutting down server in 1s
Mar 09 09:16:51 debian10 faasd[16313]: 2020/03/09 09:16:51 [Delete] removing CNI network for: basic-auth-plugin
Mar 09 09:16:51 debian10 faasd[16313]: 2020/03/09 09:16:51 [Delete] removed: basic-auth-plugin from namespace: /proc/16360/ns/net, ID: basic-auth-plugin-16360
Mar 09 09:16:51 debian10 faasd[16313]: Status of basic-auth-plugin is: running
Mar 09 09:16:51 debian10 faasd[16313]: 2020/03/09 09:16:51 Need to kill basic-auth-plugin
Mar 09 09:16:51 debian10 faasd[16313]: 2020/03/09 09:16:51 [Delete] removing CNI network for: nats
Mar 09 09:16:52 debian10 faasd[16313]: 2020/03/09 09:16:52 [Delete] removed: nats from namespace: /proc/16460/ns/net, ID: nats-16460
Mar 09 09:16:52 debian10 faasd[16313]: Status of nats is: running
Mar 09 09:16:52 debian10 faasd[16313]: 2020/03/09 09:16:52 Need to kill nats
Mar 09 09:16:52 debian10 faasd[16313]: [1] 2020/03/09 13:16:52.145386 [INF] STREAM: Shutting down.
Mar 09 09:16:52 debian10 faasd[16313]: [1] 2020/03/09 13:16:52.145588 [INF] Server Exiting..
Mar 09 09:16:52 debian10 faasd[16313]: 2020/03/09 09:16:52 [Delete] removing CNI network for: prometheus
Mar 09 09:16:52 debian10 faasd[16313]: 2020/03/09 09:16:52 [Delete] removed: prometheus from namespace: /proc/16578/ns/net, ID: prometheus-16578
Mar 09 09:16:52 debian10 faasd[16313]: Status of prometheus is: running
Mar 09 09:16:52 debian10 faasd[16313]: 2020/03/09 09:16:52 Need to kill prometheus
Mar 09 09:16:52 debian10 faasd[16313]: level=warn ts=2020-03-09T13:16:52.417Z caller=main.go:501 msg="Received SIGTERM, exiting gracefully..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:526 msg="Stopping scrape discovery manager..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:540 msg="Stopping notify discovery manager..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:562 msg="Stopping scrape manager..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:522 msg="Scrape discovery manager stopped"
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:536 msg="Notify discovery manager stopped"
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=manager.go:814 component="rule manager" msg="Stopping rule manager..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=manager.go:820 component="rule manager" msg="Rule manager stopped"
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.417Z caller=main.go:556 msg="Scrape manager stopped"
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.418Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
Mar 09 09:16:52 debian10 faasd[16313]: level=info ts=2020-03-09T13:16:52.419Z caller=main.go:727 msg="Notifier manager stopped"
Mar 09 09:17:07 debian10 faasd[16313]: Disconnected from nats://nats:4222
Mar 09 09:17:07 debian10 faasd[16313]: Reconnect
Mar 09 09:17:07 debian10 faasd[16313]: Connect: nats://nats:4222
Mar 09 09:17:09 debian10 faasd[16313]: Reconnecting (1/120) to nats://nats:4222 failed
Mar 09 09:17:09 debian10 faasd[16313]: Waiting 2s before next try
Mar 09 09:17:10 debian10 faasd[16313]: 2020/03/09 13:17:10 Disconnected from nats://nats:4222
Mar 09 09:17:10 debian10 faasd[16313]: 2020/03/09 13:17:10 Reconnect
Mar 09 09:17:10 debian10 faasd[16313]: 2020/03/09 13:17:10 Connect: nats://nats:4222
Mar 09 09:17:11 debian10 faasd[16313]: Connect: nats://nats:4222
Mar 09 09:17:12 debian10 faasd[16313]: 2020/03/09 13:17:12 Reconnecting (1/60) to nats://nats:4222 failed
Mar 09 09:17:13 debian10 faasd[16313]: Reconnecting (2/120) to nats://nats:4222 failed
Mar 09 09:17:13 debian10 faasd[16313]: Waiting 4s before next try
Mar 09 09:17:14 debian10 faasd[16313]: 2020/03/09 13:17:14 Connect: nats://nats:4222
Mar 09 09:17:16 debian10 faasd[16313]: 2020/03/09 13:17:16 Reconnecting (2/60) to nats://nats:4222 failed
Mar 09 09:17:18 debian10 faasd[16313]: Connect: nats://nats:4222
Mar 09 09:17:20 debian10 faasd[16313]: Reconnecting (3/120) to nats://nats:4222 failed
Mar 09 09:17:20 debian10 faasd[16313]: Waiting 6s before next try
Mar 09 09:17:20 debian10 faasd[16313]: 2020/03/09 13:17:20 Connect: nats://nats:4222
Mar 09 09:17:22 debian10 faasd[16313]: error deleting container prometheus, prometheus, cannot delete running task prometheus: failed precondition
Mar 09 09:17:22 debian10 faasd[16313]: 2020/03/09 09:17:22 [proxy] Done received
Mar 09 09:17:22 debian10 faasd[16313]: 2020/03/09 13:17:22 Reconnecting (3/60) to nats://nats:4222 failed
Mar 09 09:17:23 debian10 systemd[1]: faasd.service: Succeeded.
Possible Solution
Steps to Reproduce (for bugs)
- Start faasd and faasd-provider
- Deploy a function
- Stop faasd and faasd-provider
- Check container tasks with
sudo ctr -n openfaas-fn task lsandsudo ctr task ls.
Context
Your Environment
- OS and architecture: Linux debian10 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux
Latest faasd built from master.
Apparently Prometheus is preventing the stop to proceed. I changed :
File up.go
91: log.Printf("Signal received.. shutting down server in %s\n", shutdownTimeout.String())
92: err := supervisor.Remove(services)
93: if err != nil {
94: fmt.Printf("Error removing services: %s\n", err)
95: }
And got:
Mar 09 09:30:36 debian10 faasd[18159]: Error removing services: error deleting container prometheus, prometheus, cannot delete running task prometheus: failed precondition
What is weird is that Prometheus stopped thru faasd never prints it's last message like stopping it manually with kill -TERM pid:
Mar 09 10:05:56 debian10 faasd[21350]: level=info ts=2020-03-09T14:05:56.616Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
Mar 09 10:05:56 debian10 faasd[21350]: level=info ts=2020-03-09T14:05:56.616Z caller=main.go:727 msg="Notifier manager stopped"
Mar 09 10:05:56 debian10 faasd[21350]: level=info ts=2020-03-09T14:05:56.616Z caller=main.go:739 msg="See you next time!"
One issue I believe @alexellis might need to direct is how the provider should behave related to deployed functions.
Since faasd-provider does not keep the state, if it is stopped and the functions deleted, when started back the functions would not be recreated.
One option would be just stopping the functions (killing it's tasks) but keeping the container so in the event of a restart, faasd-provider would scale it back to 1 on access.
What do you think?
Why do you feel that is this change required? (I may not understand the problem well enough, I'm listening)
/lock: closed