swipl-devel icon indicating copy to clipboard operation
swipl-devel copied to clipboard

thread_signal/2 throws an existence error for threads that terminated but are not yet joined

Open pmoura opened this issue 1 year ago • 11 comments

Consider:

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.0-17-g4d781a64e)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- thread_create(true, _, [alias(t)]).
true.

?- thread_property(t, P).
P = id(3) ;
P = alias(t) ;
P = status(true) ;
P = detached(false) ;
P = debug(true) ;
P = engine(false) ;
false.

?- thread_signal(t, throw(e)).
ERROR: thread `t' does not exist
ERROR: In:
ERROR:   [12] thread_signal(t,throw(e))
ERROR:   [11] toplevel_call(user:user: ...) at /Users/pmoura/lib/swipl/boot/toplevel.pl:1317
?- thread_property(t, P).
P = id(3) ;
P = alias(t) ;
P = status(true) ;
P = detached(false) ;
P = debug(true) ;
P = engine(false) ;
false.

?- thread_join(t, S).
S = true.

?- thread_create((repeat,fail), _, [alias(w)]).
true.

?- thread_signal(w, throw(e)).
true.

?- thread_join(w, S).
S = exception(e).

The exception is arguably misleading and this behavior forces wrapping thread_signal/2 calls using catch/3 as a thread may terminate between checking that it's running and calling the predicate.

pmoura avatar Feb 09 '24 00:02 pmoura

Same problem with thread_send_message/2:

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.0-17-g4d781a64e)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- thread_create(true, _, [alias(t)]).
true.

?- thread_send_message(t, foo).
ERROR: thread `t' does not exist
ERROR: In:
ERROR:   [12] thread_send_message(t,foo)
ERROR:   [11] toplevel_call(user:user: ...) at /Users/pmoura/lib/swipl/boot/toplevel.pl:1317

Again this behavior forces using a catch/3 wrapper.

pmoura avatar Feb 09 '24 00:02 pmoura

What else do you want? Existence is a bit misleading, but a completed thread is what Unix calls a zombie process: the thing is gone, but there is still an entry in the thread/process table that allows for join/wait. It is in no way capable of processing the signal or message. We could consider another exception (permission error?), but IMO that makes things worse as it would require catching two different exceptions. If the misleading error message is your (only) concern we could add a comment to the 2nd argument of the error term?

I'm also no fan of the requirement to use catch/3. I see little alternative though. It is a bit like opening a file. In fairly static environments testing the access first may be defensible, but in a dynamic environment you must use catch/3 because the file may disappear or change permissions between the two calls.

JanWielemaker avatar Feb 09 '24 08:02 JanWielemaker

What else do you want? Existence is a bit misleading, but a completed thread is what Unix calls a zombie process: the thing is gone, but there is still an entry in the thread/process table that allows for join/wait. It is in no way capable of > I'm also no fan of the requirement to use catch/3. I see little alternative though. It is a bit like opening a file. In fairly static environments testing the access first may be defensible, but in a dynamic environment you must use catch/3 because the file may disappear or change permissions between the two calls.

If the thread message queue (the one used by thread_send_message/2) and the signal queue are only reclaimed when thread_join/2 is called, then there would be no exceptions and thus no need for a catch/3 wrapper to account for the often unpredictable cases where a thread terminates between checking that it's running and calling thread_signal/2 or thread_send_message/2. Of course, if the thread terminated by the time those calls are processed, they would be no-ops. At least in the particular case of thread_signal/2, which is used mainly to stop or debug a thread, that should not be an issue.

P.S. This implementation choice is found on (from my limited testing) in ECLiPSe and Trealla Prolog. It's also how it's implemented in LVM.

pmoura avatar Feb 09 '24 08:02 pmoura

In SWI-Prolog at least, the entire thread structure is cleared when the thread terminates. So, there is no place to deliver a signal or a message. There is also no point as it would not be processed anyway.

I agree that a signal intended to tear down the thread could be ignored if it is already dead. The only candidate for that seems thread_signal(Target, abort) though. Pretty much any other signal may have other intents. For thread messages the situation is a little more difficult as, while signals are always handled if the thread is still alive, thread messages may in general be handled or not and unhandled messages are silently discarded when the thread dies (possibly not a good idea as I think about it). The sender basically never knows unless some form of report-back is implemented. On the other hand, if we have a thread that is designed to process messages forever (a very common case) and it stops doing so due to a failure or exception it is quite nice to get an exception.

Do you have documentation from the other systems on how this is handled? I'm happy to discuss the topic with other developers.

JanWielemaker avatar Feb 09 '24 09:02 JanWielemaker

In SWI-Prolog at least, the entire thread structure is cleared when the thread terminates.

How difficult would be to that to happen only for detached threads but postpone it for attached threads until they are joined? Also, what would be the expectation that this change in semantics/behavior would break existing applications?

So, there is no place to deliver a signal or a message. There is also no point as it would not be processed anyway.

Indeed they would be no-ops (as I mentioned above) but that would avoid the need of catch/3 wrappers.

I agree that a signal intended to tear down the thread could be ignored if it is already dead. The only candidate for that seems thread_signal(Target, abort) though. Pretty much any other signal may have other intents. For thread messages the situation is a little more difficult as, while signals are always handled if the thread is still alive, thread messages may in general be handled or not and unhandled messages are silently discarded when the thread dies (possibly not a good idea as I think about it). The sender basically never knows unless some form of report-back is implemented. On the other hand, if we have a thread that is designed to process messages forever (a very common case) and it stops doing so due to a failure or exception it is quite nice to get an exception.

A possible alternative in the last scenario would be to use thread_property/2 to check that the thread is still running. Not exactly the same thing, I agree.

Do you have documentation from the other systems on how this is handled? I'm happy to discuss the topic with other developers.

I don't think this level of implementation details is explicit in the documentation of other open-source systems. At least not that I could find in a quick search. I'm part of the team developing LVM, but this is a commercial system and its documentation is not (currently) publicly available.

A discussion between developers would be welcome. My idea (if I ever find the time) is to update the threads draft standardization proposal (which currently Trealla Prolog are using as a guide) and add a test set to the Logtalk distribution. It would be great to minimize the differences between systems for better portability of multi-threading applications.

pmoura avatar Feb 09 '24 10:02 pmoura

How difficult would be to that to happen only for detached threads but postpone it for attached threads until they are joined? Also, what would be the expectation that this change in semantics/behavior would break existing applications?

It is probably easier to silently ignore messages and signals when we detect that the thread is in a zombie state. I don't really expect that to break properly functioning applications.

I expect that silently ignoring signals and messages that cannot be delivered is more a cause of problems than a way to avoid them. Notably you typically send a message to a thread if you want it to be processed. For signals the story is a bit different. Most signals are for aborting or debugging. I have also used signals to actually make threads do something though. For sending messages we have an option list that we could use to avoid an error (like close/1). We do not have that for thread_signal/2. One could also consider a high level interface for aborting and joining a thread. The debug usage is mostly interactive and controlled by more high level utilities.

A discussion between developers would be welcome.

If you organize one, I'm happy to join. You've done a lot of good work for the standard and I still regret that didn't continue. The SWI-Prolog thread API evolved quite a bit since then.

JanWielemaker avatar Feb 09 '24 12:02 JanWielemaker

Using an option in thread_get_message/3 and in a thread_signal/3 upcoming predicate to decide behavior when the thread is no longer running sounds like a good way forward without introducing backwards compatibility issues. The option could be named e.g. errors(Action) with the possible values for Action being throw, fail, succeed.

pmoura avatar Feb 09 '24 15:02 pmoura

The option could be named e.g. errors(Action) with the possible values for Action being throw, fail, succeed.

Or copy ISO close/2, which implements force(true) to ignore any error. I'm no fan, but if such a thing is acceptable to some relevant Prolog implementations, I'm happy with the compromise. One still would need to define what needs to happen if the thread is already joined or, if it is a detached thread, terminated and vanished completely.

JanWielemaker avatar Feb 09 '24 15:02 JanWielemaker

My experience from ~20 years ago, using POSIX threads on a non-Unix real-time OS (VxWorks, IIRC) is that if you don't do things exactly right,(*) all kinds of weird things can happen -- and I don't see how (or why) SWI-Prolog should deal with those situations. There's only so much you can do when the underlying system is buggy or badly designed. (In the case of VxWorks, my recollection is that it had its own threading model and provided a POSIX API that was either not quite compliant or buggy or both.) So, your proposal might fix the problem on one OS but not on another - and possibly might make things worse on another OS. (Maybe the API for pthreads has improved over the years, but when I encountered it, I did not enjoy the experience.)

(*) Where "exactly right" was often undefined in the documentation.

kamahen avatar Feb 09 '24 19:02 kamahen

The implementation is not really a problem. Linux pthreads is rock solid. MacOS has a few tweaks I managed to work around. The Windows implementation has some limits one can work around mostly by using native Windows alternatives for some. NetBSD and OpenBSD had some flaws in the past, but seem stable now as well.

The simple question is what do do if you talk to a thread that terminated, but is not yet joined. It seems some systems silently ignore the signals and messages while SWI-Prolog raises and exception. I still think that is what should happen. The alternative is much harder. Checking it is still alive before sending a message is no guarantee it is alive when you send the message. I'm more tempted to add a warning similarly to detached threads not exiting cleanly for threads that have pending messages in their input queue when they are joined or (for detached threads) die.

JanWielemaker avatar Feb 10 '24 11:02 JanWielemaker

From this discussion and my own experience, it seems clear that, depending on the application, we ideally want to either silently succeeding or throwing an exception when sending a message or a signal to a terminated (but not yet joined attached) thread. My preference goes to be able to select the desired behavior using an option. For systems like LVM and Trealla Prolog, where the implementation of multi-threading features is a work-in-progress, this is ideal time to sync on a common solution. I will draw the attention of ECLiPSe and YAP developers to this discussion. Thanks for all the feedback.

pmoura avatar Feb 10 '24 12:02 pmoura