mssql-docker icon indicating copy to clipboard operation
mssql-docker copied to clipboard

Graceful container stopping

Open jest opened this issue 6 years ago • 32 comments

On Linux, using Docker CLI it is not possible to gracefully stop the container. Running docker stop causes the daemon to send TERM signal to the container process, which is ignored and only KILL signal causes the server to stop. However, this is abrupt and the next time the container is started it rolls forward logs.

However, I noticed that the main container process forks additional sqlservr processes and if I send TERM signal to one of those processes, the whole container shuts down gracefully immediately and no log replaying is performed on the next startup.

It is looks like the problem with the process and signals management.

jest avatar Oct 03 '17 15:10 jest

+1

sokomishalov avatar Oct 03 '17 18:10 sokomishalov

I got stuck too, here is how I did:

version: '3'
services:
    db:
        image: microsoft/mssql-server-linux
        environment:
            ACCEPT_EULA: Y
            SA_PASSWORD: "xyz"

This runs successfully. I shared the port on my host

#...
        ports:
            - "1433:1433"

And updated the container:

$ docker-compose up -d db
ERROR: for my_db_1  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Now it can't be stopped nor killed

$ docker-compose stop db
# Same timeout error

$ docker-compose kill db
# Same timeout error

$ docker-compose kill -s TERM db
# Gets stuck

I even can't stop the docker service anymore (I had to restart the computer). Current version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:04:27 2017
 OS/Arch:      linux/amd64

Glideh avatar Oct 26 '17 08:10 Glideh

It worked after a computer restart but happened again after a docker-compose up refresh. A running mssql container doesn't seem liking to be updated with up

Glideh avatar Oct 27 '17 13:10 Glideh

I've been having this issue too, the container freezes and a computer restart is the only way to stop it.

Symbianx avatar Nov 07 '17 12:11 Symbianx

+1 Same here

edsonmedina avatar Nov 14 '17 17:11 edsonmedina

+1 Same here.

helenwilliamson avatar Nov 15 '17 09:11 helenwilliamson

+1 same here

akas84bg avatar Nov 20 '17 10:11 akas84bg

We think what's going on here is that SQL Server is gracefully shutting down so that when it starts back up there is no recovery time needed. We could change it to a fast shut down on CTRL+C but that could mean a longer period of time for recovery on start up depending on what was going on in the database(s) prior to the CTRL+C.

I used to run into this issue too, but then I stopped running docker-compose up and docker run interactively. In other words use -d on docker run and docker-compose up so that the containers are always started in the background and you can get your terminal prompt back. Then you can stop your containers with docker stop or docker-compose down.

Also, when you use CTRL+C to a docker-compose up interactive, you can hit CTRL+C again to force stopping immediately. I don't recommend that in anything except for a dev/test environment where you just don't care.

A few questions to better help us understand how to improve here:

  • How long are you waiting after CTRL+C before you do a more drastic shutdown?
  • Is there a lot of activity going on in the database before you CTRL+C - like a perf/scale test run or something like that?
  • Do you have a general preference of fast shutdown/slow startup vs. slow shutdown/fast startup?

twright-msft avatar Nov 20 '17 20:11 twright-msft

Sorry, but this is not a graceful shutdown. No matter how much time is given with docker stop -t time, SQL Server never stops within this time period.

It has nothing to do with CTRL+C, I never used it. My containers are started with docker-compose up -d and stopped with docker-compose stop -t time.

Please also note what I have written about sending TERM signal to one of forked processes: it leads to gracefully stopped container within 1 second! If I were to guess, I'd go for checking correct signal handling in the parent process.

jest avatar Nov 20 '17 22:11 jest

I didn't use Ctrl+C either (the first one is supposed to gracefully shutdown anyway) I'm also using docker-compose stop (after up -d) with really nothing big going on the database (tested with only one database and 4 empty tables actually).

@twright-msft I used to start/stop many different services (like nginx, apache, php, python, mysql, postgresql, redis, memcached, etc...), they always stop gracefully within 4sec max. I might still have a preference for the fast startup/slow shutdown but the slow should be within 4sec. Anyway, as @jest says, sometimes it never stops, I already tried leaving the graceful stop running for at least 30min.

Glideh avatar Nov 21 '17 02:11 Glideh

I've seen both cases, both intermittently. Sometimes they work, sometimes they don't.

Using CRTL + C just hangs forever (I've waited more than 10 minutes) and the container becomes non-responsive (can't exec into it).

Using docker-compose stop/down instead returns me a timeout. The container never dies.

This makes it useless until it's fixed.

edsonmedina avatar Nov 21 '17 10:11 edsonmedina

OK, I dig a bit and here's a solution.

The problem is this line in Dockerfile

CMD /opt/mssql/bin/sqlservr

According to the Docker docs its "shell syntax" causes Docker daemon to run the container with a command:

/bin/sh -c /opt/mssql/bin/sqlservr

Which makes Bash a "PID 1" process and causes a lot of problems, including signal handling and children reaping. The issue on tini describes it pretty well.

The solution is to modify Dockerfile and either to make sqlservr "PID 1" itself using another CMD syntax:

CMD ["/opt/mssql/bin/sqlservr"]

or better yet, to use some other "process manager", like the mentioned tini:

# with tini next to Dockerfile...
COPY tini /
RUN chmod +x /tini
ENTRYPOINT ["/tini", "--"]
CMD ["/opt/mssql/bin/sqlservr"]

As a workaround till new images are available, use command: [ "/opt/mssql/bin/sqlservr" ] in your docker-compose.yml to overwrite the image's CMD.

jest avatar Nov 21 '17 19:11 jest

@twright-msft Any idea how this will be solved? Do you need a PR?

jest avatar Nov 24 '17 12:11 jest

Any news on this? We're using the container for testing in a CI pipeline and have to restart our server practically every day because of this. Neither overwriting the command with CMD ["/opt/mssql/bin/sqlservr"] nor adding tini as suggested help with the problem.

woylie avatar Jan 23 '18 07:01 woylie

We're likely going to switch to this in a near future release. CMD ["/opt/mssql/bin/sqlservr"] We'll see if that helps fix it for at least some people.

twright-msft avatar Jan 23 '18 07:01 twright-msft

Well, for us it didn't. Any more ideas?

woylie avatar Jan 23 '18 07:01 woylie

Probably other issue?

jest avatar Jan 23 '18 09:01 jest

The workaround command: [ "/opt/mssql/bin/sqlservr" ] did not work for me either.

simdevmon avatar Jan 31 '18 12:01 simdevmon

I use the following workaround in our CI environment:

  • I use the option command: [ "/opt/mssql/bin/sqlservr" ] in docker compose
  • I just kill the process before I call docker-compose stop: docker exec <container-name> kill 1 || :

simdevmon avatar Feb 01 '18 08:02 simdevmon

Did you destroy the old containers and created new ones with command: workaround? Once created, containers can't change their command to be executed. What does docker inspect -f '{{ .Config.Cmd }}' <container-name> says?

jest avatar Feb 01 '18 09:02 jest

@jest The output is [/opt/mssql/bin/sqlservr]

And yes, since it is only a CI environment I destroy everything completly on each build

docker exec <mssql-container-name> kill 1 || :
docker-compose stop
docker-compose rm -f
docker-compose build
docker-compose up -d

simdevmon avatar Feb 01 '18 09:02 simdevmon

We started running into issues with the MS SQL Server containers hanging around on our Jenkins instance after builds completed (or didn't). It eventually got bad enough that the servers would lock up and de-provisioning them would take up to 30 minutes.

The solution for killing process 1 seems to solve the issue for us: https://github.com/Microsoft/mssql-docker/issues/171#issuecomment-362193062

kevin-brown avatar Feb 28 '18 03:02 kevin-brown

Update: overriding the command within a Dockerfile, or through specifying it when running, did not solve the problem of zombie processes and MS SQL Server.

We are seeing a problem very similar to #181, which has the same behaviour as the issue described in this ticket, after using a SQL Server instance (CU2, CU4, GA tested) for a short period of time and then trying to shut it down. I'm going to put the odds of it hanging at 50/50 every time we spin up a new container. Sending the TERM or KILL signals to the container or sqlservr processes does not solve the issue for us, the processes refuse the die unless the system is de-provisioned.

Note that we are not using Docker Compose on our build servers, and we are seeing this issue when running the containers through the Docker engine directly.

kevin-brown avatar Mar 08 '18 01:03 kevin-brown

@kevin-brown So this issue is not the one you are experiencing. This issue is about wrong image's CMD construction, where signals are not propagated to child processes.

Sending signals directly to child processes is the same as correcting CMD in Dockerfile.

jest avatar Mar 08 '18 11:03 jest

@kevin-brown we are facing probably the same issue and we use tini but no luck. Do you believe that "-g" option on tini to kill the whole process group could make a difference? We are going to try it

hdimitriou avatar Mar 14 '18 12:03 hdimitriou

So this issue is not the one you are experiencing. This issue is about wrong image's CMD construction, where signals are not propagated to child processes.

We're seeing signs of the signals not propagating when we send them to the Docker images, and attempt to send them directly to the process. The behaviour we're seeing in #181 is making it really difficult to verify the signals are making it to sqlservr because if it hangs for too long it completely locks up Docker and the host system.

I'm willing to accept that there are two different issues at play in #171 and #181, but the fact that both of them deal with zombie processes forming within the container gives me hope that there may be a common solution to both issues.

we are facing probably the same issue and we use tini but no luck. Do you believe that "-g" option on tini to kill the whole process group could make a difference? We are going to try it

We have not yet tried using tini to work around this issue, but if you're not currently killing the right process (but instead are killing a parent process) that might work.

kevin-brown avatar Mar 19 '18 02:03 kevin-brown

Anyone having problems with CTRL+C that are not solved by correcting ENTRYPOINT (as described in comment https://github.com/Microsoft/mssql-docker/issues/171#issuecomment-346133376), please test 2017-CU5. According to https://support.microsoft.com/en-us/help/4093805/fix-can-t-stop-sql-server-linux-docker-container-via-docker-stop it's solved there.

jest avatar Apr 16 '18 09:04 jest

@jest CU5 seems to fix this for me. But with CU6 the same problem occurs again.

jschaefer-pott avatar Sep 19 '18 05:09 jschaefer-pott

I am using CU12 and I am seeing the same issue

kichalla avatar Apr 23 '19 18:04 kichalla

I've faced some issues with this as well. I have been using a version which does not spawn mssql inside a shell (IE. I've been using a sufficiently recent version that contains addd8374e7ff488a916e4ed1ec634b364b649209), but still experience inability to shut down the container. docker kill halts and I can't even restart the daemon, I can only restart the machine.

The logs indicate that a signal was received, but it apparently entered some weird state afterwards.

[...]
2019-06-18 09:05:39.68 spid6s      Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
2019-06-18 09:05:39.68 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2019-06-18 09:05:39.78 spid22s     Service Broker manager has shut down.
2019-06-18 09:05:43.43 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:05:43.43 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]
2019-06-18 09:05:48.61 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:05:48.61 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]
2019-06-18 09:10:53.49 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:10:53.49 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]

While a normal shutdown looks like following.

[...]
2019-06-18 10:33:05.68 spid6s      Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
2019-06-18 10:33:05.68 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2019-06-18 10:33:06.11 spid23s     Service Broker manager has shut down.
2019-06-18 10:33:11.29 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

badeball avatar Jun 18 '19 10:06 badeball