codechecker icon indicating copy to clipboard operation
codechecker copied to clipboard

CodeChecker server --stop does not stop the server cleanly

Open jimis opened this issue 4 years ago • 6 comments

I have written a systemd unit to start and stop CodeChecker. It contains the following lines:

Type=simple
ExecStart=/home/codechecker/codechecker/build/CodeChecker/bin/CodeChecker server --verbose info
ExecStop=/home/codechecker/codechecker/build/CodeChecker/bin/CodeChecker server --stop

Unfortunately most times I do systemctl stop CodeChecker I get errors in the logs:

systemd[1]: Stopping CodeChecker server...
CodeChecker[990]: [INFO 2021-11-23 13:14] - Shutting down the WEB server on [localhost:8001]
CodeChecker[1417]: ----------------------------------------
CodeChecker[1417]: Exception happened during processing of request from ('127.0.0.1', 58116)
CodeChecker[1417]: Traceback (most recent call last):
CodeChecker[1417]:   File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock
CodeChecker[1417]:     self.process_request(request, client_address)
CodeChecker[1417]:   File "/usr/lib/python3.8/socketserver.py", line 347, in process_request
CodeChecker[1417]:     self.finish_request(request, client_address)
CodeChecker[1417]:   File "/usr/lib/python3.8/socketserver.py", line 360, in finish_request
CodeChecker[1417]:     self.RequestHandlerClass(request, client_address, self)
CodeChecker[1417]:   File "/home/codechecker/codechecker/build/CodeChecker/lib/python3/codechecker_server/server.py", line 84, in __init__
CodeChecker[1417]:     BaseHTTPRequestHandler.__init__(self,
CodeChecker[1417]:   File "/usr/lib/python3.8/socketserver.py", line 747, in __init__
CodeChecker[1417]:     self.handle()
CodeChecker[1417]:   File "/usr/lib/python3.8/http/server.py", line 427, in handle
CodeChecker[1417]:     self.handle_one_request()
CodeChecker[1417]:   File "/usr/lib/python3.8/http/server.py", line 415, in handle_one_request
CodeChecker[1417]:     method()
CodeChecker[1417]:   File "/home/codechecker/codechecker/build/CodeChecker/lib/python3/codechecker_server/server.py", line 269, in do_GET
CodeChecker[1417]:     SimpleHTTPRequestHandler.do_GET(self)
CodeChecker[1417]:   File "/usr/lib/python3.8/http/server.py", line 654, in do_GET
CodeChecker[1417]:     self.copyfile(f, self.wfile)
CodeChecker[1417]:   File "/usr/lib/python3.8/http/server.py", line 853, in copyfile
CodeChecker[1417]:     shutil.copyfileobj(source, outputfile)
CodeChecker[1417]:   File "/usr/lib/python3.8/shutil.py", line 208, in copyfileobj
CodeChecker[1417]:     fdst_write(buf)
CodeChecker[1417]:   File "/usr/lib/python3.8/socketserver.py", line 826, in write
CodeChecker[1417]:     self._sock.sendall(b)
CodeChecker[1417]: BrokenPipeError: [Errno 32] Broken pipe
CodeChecker[1417]: ----------------------------------------
CodeChecker[1419]: ----------------------------------------
CodeChecker[1419]: Exception happened during processing of request from ('127.0.0.1', 58106)
CodeChecker[1419]: Traceback (most recent call last):
CodeChecker[1419]:   File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock
CodeChecker[1419]:     self.process_request(request, client_address)
CodeChecker[1419]:   File "/usr/lib/python3.8/socketserver.py", line 347, in process_request
CodeChecker[1419]:     self.finish_request(request, client_address)
CodeChecker[1419]:   File "/usr/lib/python3.8/socketserver.py", line 360, in finish_request
CodeChecker[1419]:     self.RequestHandlerClass(request, client_address, self)
CodeChecker[1419]:   File "/home/codechecker/codechecker/build/CodeChecker/lib/python3/codechecker_server/server.py", line 84, in __init__
CodeChecker[1419]:     BaseHTTPRequestHandler.__init__(self,
CodeChecker[1419]:   File "/usr/lib/python3.8/socketserver.py", line 747, in __init__
CodeChecker[1419]:     self.handle()
CodeChecker[1419]:   File "/usr/lib/python3.8/http/server.py", line 427, in handle
CodeChecker[1419]:     self.handle_one_request()
CodeChecker[1419]:   File "/usr/lib/python3.8/http/server.py", line 415, in handle_one_request
CodeChecker[1419]:     method()
CodeChecker[1419]:   File "/home/codechecker/codechecker/build/CodeChecker/lib/python3/codechecker_server/server.py", line 269, in do_GET
CodeChecker[1419]:     SimpleHTTPRequestHandler.do_GET(self)
CodeChecker[1419]:   File "/usr/lib/python3.8/http/server.py", line 654, in do_GET
CodeChecker[1419]:     self.copyfile(f, self.wfile)
CodeChecker[1419]:   File "/usr/lib/python3.8/http/server.py", line 853, in copyfile
CodeChecker[1419]:     shutil.copyfileobj(source, outputfile)
CodeChecker[1419]:   File "/usr/lib/python3.8/shutil.py", line 208, in copyfileobj
CodeChecker[1419]:     fdst_write(buf)
CodeChecker[1419]:   File "/usr/lib/python3.8/socketserver.py", line 826, in write
CodeChecker[1419]:     self._sock.sendall(b)
CodeChecker[1419]: BrokenPipeError: [Errno 32] Broken pipe
CodeChecker[1419]: ----------------------------------------
systemd[1]: CodeChecker.service: Main process exited, code=exited, status=143/n/a
CodeChecker[149175]: [INFO 2021-11-23 13:14] - Stopped CodeChecker server running on port 8001 in workspace /home/cc-server/.codechecker (PID: 990)
  • Why are the processes exiting uncleanly? Systemd manual mentions that the ExecStop command should be synchronous, i.e. all processes should have exited by the time it returns, otherwise they are killed. Is that the case?
  • Even if the processes die cleanly, the main process exit code is 143. Shouldn't it be zero?

jimis avatar Nov 23 '21 13:11 jimis

ExecStop behaves as intended. CodeChecker is a multiprocessing program. (Multithreading would invoke the Global Interpreter Lock which would mean the execution wouldn't actually be multithreaded...) When you tell it to stop, the main process will receive the appropriate signal, and will begin to destroy the process pool under it. But right when you do this, one of the children is still processing a request:

systemd[1]: Stopping CodeChecker server...
# Systemd received your command and begun the killing

CodeChecker[990]: [INFO 2021-11-23 13:14] - Shutting down the WEB server on [localhost:8001]
# CodeChecker's main instance received the kill signal

# Process 0 killed (clean)
# Process 1 killed (clean)
# ...

CodeChecker[1417]: Exception happened during processing of request from ('127.0.0.1', 58116)
# Process X killed unclean.

CodeChecker[1419]: Exception happened during processing of request from ('127.0.0.1', 58106)
# Process X+Y killed unclean. (Note different PID and the port it is trying to reply to is also different.)

# ... (more handlers dying cleanly)

systemd[1]: CodeChecker.service: Main process exited, code=exited, status=143/n/a
# The main process (PID 990) exited.

CodeChecker[149175]: [INFO 2021-11-23 13:14] - Stopped CodeChecker server running on port 8001 in workspace /home/cc-server/.codechecker (PID: 990)
# ^ This last line is the output of the "ExecStop" command that you called, not the running server's.

I'm not sure why it is BrokenPipe. Most programming languages have a specific exception type for threads/processes that were killed from the outside.


~~I also can't seem to figure out where 143 is coming from. Might be coincidence, but 143 - 32 (errno broken pipe) is actually 111, which would correspond to -17 (128+(-17) = 111), and signal 17 is SIGCHLD. But this seems too much of a coincidence.~~

Edit: Snip. Invalid reasoning.

whisperity avatar Dec 06 '21 14:12 whisperity

Alright, I figured out what 143 is. It's 128+15. So SIGTERM. From POSIX manuals:

Exit Status for Commands

The exit status of a command that terminated because it received a signal shall be reported as greater than 128.

I haven't tried writing a systemd script, but running manually on a clean state (not interrupting ongoing responses), the following happens.

Serverside

[INFO 2021-12-06 15:52] - Server waiting for client requests on [localhost:8001]
# Signal receives here.
[INFO 2021-12-06 15:52] - Shutting down the WEB server on [localhost:8001]
$ echo $?
143

in verbose log

[INFO][2021-12-06 15:58:57] {server} [23513] <140096969881408> - server.py:1048 signal_handler() - Shutting down the WEB server on [localhost:8001]

"Clientside" (from the invocation perspective of server --stop)

[INFO 2021-12-06 15:52] - Stopped CodeChecker server running on port 8001 in workspace /home/$USER/.codechecker (PID: 22587)
$ echo $?
0

whisperity avatar Dec 06 '21 14:12 whisperity

When you tell it to stop, the main process will receive the appropriate signal, and will begin to destroy the process pool under it.

Will it wait for the children to exit before exiting itself?

A typical termination model to imitate for graceful exit, would be that of Apache httpd. In prefork mode it also uses pool of processes. More in the apache docs. Unfortunately I don't have an apache instance running to check the daemon exit code but I would guess it's zero if all children exit cleanly upon the TERM request.

But right when you do this, one of the children is still processing a request

I am quite sure that I got the same error code on the server exit even when the server was not processing any request. I hope I'm not missing some background connection from my browser.

Alright, I figured out what 143 is. It's 128+15. So SIGTERM.

Aha! That might mean that CodeChecker server is not handling the TERM signal.

From POSIX manuals:

Exit Status for Commands

The exit status of a command that terminated because it received a signal shall be reported as greater than 128.

Link? This is probably for unhandled signals or ungraceful exit. If CodeChecker server receives the "stop" command from the client by installing a signal handler on the TERM signal, and waits for the subprocesses to exit or verifiably kills them, then it should exit(0).

It should exit with error if it fails to fulfill the requested task (killing its children), for example if one child process wouldn't die after 2 seconds and the server had to give up.

jimis avatar Dec 06 '21 17:12 jimis

Link?

https://tldp.org/LDP/abs/html/exitcodes.html

If CodeChecker server receives the "stop" command from the client by installing a signal handler on the TERM signal, and waits for the subprocesses to exit or verifiably kills them, then it should exit(0).

It is handling the signal, and exiting with the hand-crafted signal number:

https://github.com/Ericsson/codechecker/blob/b559dd33ce0a83dd890144ac7dbc8e5642208750/web/server/codechecker_server/server.py#L1041-L1055

whisperity avatar Dec 07 '21 09:12 whisperity

The behaviour was changed in #2205. In the comment message it describes the reason why it is changed.

In systemd service you can use the SuccessExitStatus settings to set this exit code as a successful termination and it will solve your problem: https://www.freedesktop.org/software/systemd/man/systemd.service.html#SuccessExitStatus=

[Service]
Type=simple
ExecStart=/home/codechecker/codechecker/build/CodeChecker/bin/CodeChecker server
ExecStop=/home/codechecker/codechecker/build/CodeChecker/bin/CodeChecker server --stop
SuccessExitStatus=143

csordasmarton avatar Dec 10 '21 09:12 csordasmarton

Thanks for the references. FWIW I am quite sure that if CodeChecker handles the TERM signal gracefully, it should exit(0). Any exit code different than zero, means error! Even according to the page you linked to, exit code 143 means "Fatal error signal 15". Running "codechecker stop" shouldn't cause a fatal error. :-)

Anyway I have updated my systemd unit file to include SuccessExitStatus=143. Feel free to close the issue.

EDIT: Found this question with the most upvoted answer referencing the POSIX standard and answering:

If your program receives a signal then successfully exits, this is considered a successful termination, so your program should return EXIT_SUCCESS.

jimis avatar Dec 21 '21 04:12 jimis