nng Can we provide a way for applications to know when system calls were interrupted (the EINTR case)

Some applications would like to be able to catch SIGINT to interrupt a blocking nng_recv() call, but if a user installs a signal handler for SIGINT, nng_recv() will never return. I haven't determined the exact cause yet, but my concern based on gdbing a process is that it's due to a pthread_cond_wait, which according to POSIX cannot return EINTR, so we can't just let the EINTR percolate up. This is causing a problem a problem in pynng (https://github.com/codypiersall/pynng/issues/49), because the Python runtime catches SIGINT, so the user can't press Ctrl+C to make pynng stop doing whatever it's doing.

I'm really not sure what the best way forward here is. I wouldn't propose changing the default behavior for nng_recv or nng_send, but I would love for them to grow a flag for something like NNG_NORETRY_SIGINT (the name doesn't matter to me), which would allow nng_recv() and nng_send() to return NNG_EINTR.

I'll work on an implementation, but I'm really not confident the current approach I'll be trying is good. My idea now is that if the pthread_cond_wait wakes up spuriously I'll return NNG_EINTR and let the logic percolate on up, but I'm pretty sure you can't tell the difference between a spurious wakeup and another thread just beating you to setting the flag on the condition variable.

Any better ideas are welcome! The reproducing snippet is here:

Code reproducing snippet:

#include <assert.h>
#include <signal.h>
#include <stdio.h>

#include <nng.h>
#include <protocol/pair0/pair.h>

void sig_handler(int signo)
{
  if (signo == SIGINT)
    printf("received SIGINT\n");
}

int main() {
    size_t msg_size;
    void *msg;
    nng_socket sock;
    if (signal(SIGINT, sig_handler) == SIG_ERR)
        printf("\ncan't catch SIGINT\n");
    assert(nng_pair0_open(&sock) == 0);
    assert(nng_recv(sock, &msg, &msg_size, NNG_FLAG_ALLOC) == 0);
    return 0;
}

Backtrace after interrupting process:

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000417f6b in nni_pthread_cond_wait (c=0x655aa0, m=0x655a78) at ../src/platform/posix/posix_thread.c:120
#2  0x00000000004180d5 in nni_plat_cv_wait (cv=0x655aa0) at ../src/platform/posix/posix_thread.c:181
#3  0x0000000000415437 in nni_cv_wait (cv=0x655aa0) at ../src/core/thread.c:51
#4  0x0000000000415166 in nni_task_wait (task=0x655a40) at ../src/core/taskq.c:216
#5  0x0000000000407526 in nni_aio_wait (aio=0x655860) at ../src/core/aio.c:350
#6  0x0000000000406cd5 in nng_aio_wait (aio=0x655860) at ../src/nng.c:1157
#7  0x0000000000402a74 in nng_recvmsg (s=..., msgp=0x7fffffffd9c8, flags=0) at ../src/nng.c:137
#8  0x00000000004028ab in nng_recv (s=..., buf=0x7fffffffda10, szp=0x7fffffffda08, flags=1) at ../src/nng.c:93
#9  0x000000000040272a in main () at something.c:21```
</details>

Oct 26 '19 18:10 codypiersall

You’re going about this the wrong way.

What you should do is probably set a flag to wake up a thread in your user program that just closes the socket in question.

We don’t want to force user apps to deal with interrupted system calls and really there is no portable way to do that. If you’re just trying to shut down properly then close the socket.

Closing the socket will wake any threads blocked in receive or send operations.

Oct 26 '19 20:10 gdamore

What you should do is probably set a flag to wake up a thread in your user program that just closes the socket in question.

The problem is that this breaks all the convention that Python applications are used to; whenever the user hits Ctrl+C, they expect to receive a KeyboardInterrupt.

I'll check to see what pyzmq does; I'm pretty sure that zmq does the same thing nng does in terms of retrying system calls, but pyzmq does the expected thing for a Python application, which is to throw a KeyboardInterrupt exception whenever SIGINT is received.

We don’t want to force user apps to deal with interrupted system calls

I don't want to force user apps to have to deal with it, but I would like apps to be able to opt in to dealing with interrupted system calls, probably by growing an extra flag in nng_send and nng_recv.

If you’re just trying to shut down properly then close the socket.

It's pretty standard, at least in a lot of Python applications I've seen (and written!) to just launch a program and expect it to die on Ctrl+C. Whenever it doesn't work it's pretty disappointing.

I don't think it makes sense for pynng to install a signal handler for SIGINT, because applications shouldn't have to deal with a library doing that.

Oct 26 '19 22:10 codypiersall

I ran into this blog post by Martin Sustrik about EINTR: http://250bpm.com/blog:12

It mentions that pyzmq used to act like pynng is acting now, but it looks like zmq started returning EINTR and pyzmq was then satisfied.

Here are a few of the relevant quotes from the post by Sustrik:

To give you a real world example of incorrectly implemented blocking function, here's a problem we encountered with ZeroMQ couple of years ago: Ctrl+C did not work when ZeroMQ library was used from Python (via pyzmq language binding). After some investigation, it turned out that Python runtime works more or less like the examples above. If Ctrl+C signal is caught, it sets a variable in the handler and continues the execution until it gets to a point where signal-induced conditions are checked.

However, ZeroMQ library used to have a blocking recv function, that (oops!) haven't returned EINTR and rather ignored the signals.

What happened was that user called ZeroMQ's recv function from Python, which started waiting for incoming data. Then the user pressed Ctrl+C. Python's signal handler handled the signal by marking down that the process should be terminated as soon as possible. However, the execution was blocked inside ZeroMQ's recv function which never returned back to the Python runtime and thus the termination never happened.

Exiting the recv function with EINTR in case of signal solved the problem.

Additionally, he despairs for the case of signals on Windows, and mentions "to use sem_wait (which returns EINTR) instead of pthread_cond_wait." I'll look into both of these some more, and also check out zmq to see what it does. Legacy nanomsg (if I read the source right) has a compile option to automatically retry EINTR syscalls, but otherwise will return NN_EINTR.

I'm also changing the title of the issue from "Can we provide a way for applications to catch SIGINT," which doesn't actually even make sense, to "Can we provide a way for applications to know when system calls were interrupted (the EINTR case)", which I think makes more sense.

Oct 27 '19 01:10 codypiersall

Okay, on Windows it turns out that pyzmq and the legacy nanomsg bindings do the same thing as pynng, but that is just due to how Windows and POSIX differ on restartable system calls, as far as I can tell. There was an interesting discussion on pyzmq and also a workaround that pyzmq created for dealing with the lack of EINTR on Windows. So I guess on Windows, pynng can't do any better, but may be able to reuse the same hack.

I'm still holding on to hope for coming to a cleaner solution in the POSIX case :-)

Oct 27 '19 01:10 codypiersall

A little spelunking in libzmq's git history reveals that EINTR was allowed to percolate to callers in commit https://github.com/zeromq/libzmq/commit/91ea20464439b5359a5.

Additionally, and more importantly, it turns out that it is not possible for a Python application (so any application using nng in Python, whether my bindings or someone else's) to register a signal handler to solve this problem due to the way the Python runtime runs signal handlers. Here's the relevant quote from the Python signal docs:

Python signal handlers are always executed in the main Python thread, even if the signal was received in another thread. This means that signals can’t be used as a means of inter-thread communication. You can use the synchronization primitives from the threading module instead.

I found this out after I started trying to figure out what pynng can do without upstream changing to re-enable KeyboardInterrupt; turns out, not anything, while staying inside the comfy confines of the Python runtime.

Oct 27 '19 02:10 codypiersall

Sounds like this is a bug in python then. If you can’t wake another thread then your options are seriously limited.

Doing the EINTR think is dirty as heck and non portable to boot. We could do it but I strongly dislike the approach of relying on magically getting an errno that says an interrupt arrived.

It’s notably true that there are other interrupts besides SIGINT. You don’t have that context with EINTR. The only way to get that context is from the signal handler. If python is not letting you run your own handler then it’s preventing this and it’s a serious deficiency.

Honestly if I was faced with this problem I would probably just put the tty in raw mode and handle the control c myself because clearly python is not being helpful here.

Oct 27 '19 04:10 gdamore

Sounds like this is a bug in python then.

I think it was a design decision more than a bug, maybe due to some limitations of the Python VM. I'll try to find some history for why it's this way. I imagine it's because of the Python VM being single-threaded, and things happening in other threads could break Python's guarantees, but that's conjecture.

Doing the EINTR think is dirty as heck and non portable to boot.

Yeah, realizing that this could never work on Windows was sad for me, and it looks like pyzmq's example for how to catch Ctrl+C on Windows is basically a hack. I say that without having actually looked at the implementation yet though.

If python is not letting you run your own handler then it’s preventing this and it’s a serious deficiency.

Python does let you run your own handler, but it's limited, because internally the Python VM sets a flag in its signal handler and then calls the signal handler you registered in the main thread at a later time. To quote the docs:

A Python signal handler does not get executed inside the low-level (C) signal handler. Instead, the low-level signal handler sets a flag which tells the virtual machine to execute the corresponding Python signal handler at a later point(for example at the next bytecode instruction).

Oct 27 '19 18:10 codypiersall

Well, that's probably a good thing then.

You should still be able to close the socket at that point then.

One main point, in case it has not already occurred, is that single threaded python is going to be incompatible with blocking nng calls. But I think you're using the AIO framework anyway, right?

Oct 27 '19 21:10 gdamore

So I still think closing the socket is the best way to handle this. Is there some reason that won't work?

Nov 03 '19 18:11 gdamore

Sorry, I think my last post was a bit stream-of-consciousness.

Unfortunately, the Python runtime will never actually call any installed signal handlers when a blocking call is made in pynng, because the Python main loop only calls its signal handlers between bytecode evaluations: (link to where they're called in CPython source). So in the at-least-somewhat-normal case where the main thread is blocked on nng_recv() or nng_send(), they only way to kill the application is to kill -9 it.

So a very common use case where this bites me is at the read-eval-print loop (REPL). I'm pretty often experimenting, or trying out changes really quickly, and I'll have a session like this:

[08:06PM] (py37) cody@compster ~/dev/cpython
± % python
>>> from pynng import Pair0
>>> sock = Pair0(listen='tcp://127.0.0.1:31313')
>>> sock.recv()
^C
^C^C^C^C
^C^C^C^C^C^C


^C
^C
^Z
[1]  + 23218 suspended  python

tl;dr it won't work because the runtime never stops executing the current bytecode instruction, so can't ever run the Python-level installed signal handlers.

Nov 05 '19 02:11 codypiersall

Ah, ok, I see. This is unfortunate indeed. What I'd propose then is to have some special property on the socket that disables restart of an interrupted system call. NNG_NORESTART.

A way to make this work for Windows is probably to arrange for the C code to either use callbacks, or to establish its own interrupt handler. I'm not sure how easy or difficult that is in the presence of the Python runtime.

Nov 05 '19 14:11 gdamore

Btw, I've seen that same sort of "don't handle control C" from many other programs. I usually do the control-Z and kill %1 trick to get past them.

Nov 05 '19 14:11 gdamore

What I'd propose then is to have some special property on the socket that disables restart of an interrupted system call. NNG_NORESTART.

Ah yeah, a socket option seems like the right approach. Better than a flag to nng_recv. I'll work on a PR for that.

A way to make this work for Windows is probably to arrange for the C code to either use callbacks, or to establish its own interrupt handler. I'm not sure how easy or difficult that is in the presence of the Python runtime.

I still haven't looked into how the pyzmq folks fixed this on Windows, but it looked more like a workaround than a nice fix. At any rate, I think Windows users are used to Ctrl+C not working to interrupt things.

Nov 09 '19 03:11 codypiersall

Did you want to try to put together a PR, or should I add it to my backlog?

Feb 25 '20 08:02 gdamore

Going back and looking at this, I want to punch Python in the nose. Again.

Basically, the fact that they have elected to rob control from user applications for handling signals means that applications are at the mercy of others here.

To be honest, using SigInt is 100% the wrong way to rely on CTRL-C. It's the lazy way of unsophisticated applications.

The right way (that would work on Windows as well) is to install a keyboard handler that monitors incoming key presses and does something sensible with them. If CTRL-C is meant to abort the application then it should do that.

Signals are one of the biggest mistakes in UNIX history, and it's a good thing that Windows didn't repeat that mistake.

Jul 27 '20 00:07 gdamore

Going back and looking at this, I want to punch Python in the nose. Again.

:joy:

Did you want to try to put together a PR, or should I add it to my backlog?

I would actually like to put a PR together, but probably won't have time for a month or two. Just moved into a new house that requires a lot of work to be done on it. If you're not keen on working on this, I'd be happy to do it, but it may not make it in your next release. I forgot about this, apparently, back in November.

Jul 27 '20 15:07 codypiersall

Ok, that's fine. It's not bothering me lol.

Jul 28 '20 03:07 gdamore

Is there a fix yet? The current version still seems to have this problem.

Feb 05 '21 05:02 creativesands

Stopping by from above pynng issue, it would really be nice if this could be addressed/changed so ^c would work as expected :)

Dec 04 '22 21:12 Sec42

I have not done a fix for this.

I'm still a bit confused by this. Perhaps the problem is that use of our blocking calls creates a problem for Python users. A workaround in NNG is to use non-blocking variants, but that might be very difficult to resolve. (Did I mention how much I despise Python?)

It would be nice if someone else would put together a PR for this that added an option to say -- don't use SA_RESTART. Which pynng or pynng users could set.

Python has done it's own users a disservice here, because any caller could run into this.

Feb 05 '23 23:02 gdamore

nng nng copied to clipboard

Can we provide a way for applications to know when system calls were interrupted (the EINTR case)

nng
nng copied to clipboard