Revisiting gracefulClose with STM racing
First of all, please read Implementing graceful-close in Haskell network library. This PR revisits "Approach 4: callbacks of the IO/Timer manager".
What we know so far:
-
timeoutcannot be used in the resource-release clause ofbracket. This is because an asynchronous exception, which is used intimeout, cannot be delivered in the clause. - Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.
The approach 4 uses MVar. Even we use tryPutMVar, the IO manager is sometime blocked. This results in so many CLOSE_WAITs on Linux. This name is confusing but it appeared that they represent entries of the listen queue, not established connections waiting close on Linux. Anyway, it is a clear evidence of the blocked IO manager.
@khibino suggested me to use STM actions both for timeout and recv-ready. Since they are a single action for earch, they are not blocked. And they can be composed because they are STM actions. This PR implements his idea instead of using MVar.
I deployed this network library on my test server. So far, I don't see any CLOSE_WAIT.
Cc: @larskuhtz @snoyberg @fosskers @nh2
@kazu-yamamoto There are stilla couple things I don't understand, you need to elaborate:
timeoutcannot be used in the resource-release clause ofbracket. This is because an asynchronous exception, which is used intimeout, cannot be delivered in the clause.
Where is this timeout interaction in bracket? Which bracket are we talking about?
It is only shortly mentioned in your post, but I don't really understand what scenario/implementation that's referring to.
Thus, I do not currently understand why we can't just do
timeout lingerTimeout (forever (recv s))
- Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.
Can you outline the (problematic) Approach 4 with sample code, so that I can understand it, and its problem?
I also don't understand what's wrong with the current code (which is Approach 3).
Is that explained somewhere? What's the current problem, is there an open issue I missed?
The commit message should mention what the problem is that the commit solves.
Thank you -- I'm eager to participate in this!
timeoutcannot be used in the resource-release clause ofbracket. This is because an asynchronous exception, which is used intimeout, cannot be delivered in the clause.
Where is this
timeoutinteraction inbracket? Whichbracketare we talking about?
Sorry for your confusion. This is about approach 2.
- Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.
Can you outline the (problematic) Approach 4 with sample code, so that I can understand it, and its problem?
77ee187d8eb00755213d61fc6bfa85dae6816fd8 is the commit to remove the final code of approach 4.
I also don't understand what's wrong with the current code (which is Approach 3).
Unnecessary delay happens.
For instance, if we delay 30 us but a packet arrives after 10us, we wait for unnecesary 20 us.
The commit message should mention what the problem is that the commit solves.
Thank you for the suggestion. I will.
Sorry for your confusion. This is about approach 2.
Yes, that I understand, but in the post it only says:
Of course, a very easy way is combine the
timeoutfunction and the originalrecvfunction which may trigger the IO manager. This does not work if used withbracketbecausetimeoutthrows an asynchronous exception but it does not reach the resource-release action inbracket.
I don't understand which bracket is being referred to here. What's the proposed example code that doesn't work? It it something like
timeout lingerTimeout (forever (recvUntilEof s))
Why doesn't that work?
Suppose we define:
gracefulClose t s = timeout t (recv s ...)
We cannot use this function as follows:
bracket (openSocket ...) gracefulClose $ \sock ...
bracket (openSocket ...) gracefulClose $ \sock ...
Maybe we (in warp, and users in general) should not use the function that way anyway!
Because then you get an uninterruptible cleanup action that runs for a long time. You keep pressing Ctrl+C and the program does not terminate.
Shouldn't we write the function so that the second async exception immediately calls close and stops the graceful-close loop?
For example, inlining the definition of bracket (also described in docs of mask and modifying it a bit:
-- | The first async exception will initiate graceful closing,
-- a second one will close the socket (which may result in
-- TCP connection resets of the other side).
withInterruptibleGracefulClose :: (Socket -> IO a) -> IO a
withInterruptibleGracefulClose action = do
mask $ \restore -> do
s <- openSocket
let gracefulCloseOnExceptionClose = (gracefulClose s) `onException` (close s)
r <- restore (action s) `onException` (restore gracefulCloseOnExceptionClose)
_ <- restore gracefulCloseOnExceptionClose
return r
That said, I do see the appeal of having a function that can be used as you say, with
bracket (openSocket ...) gracefulClose $ \sock ...
To make terminology clear, we would then have to declare that gracefulClose is an "interruptible" function, as referred to by these docs:
Interruptible functions
Docs of bracket say:
Bracket wraps the release action with mask, which is sufficient to ensure that the release action executes to completion when it does not invoke any interruptible actions, even in the presence of asynchronous exceptions. For example, hClose is uninterruptible when it is not racing other uses of the handle. Similarly, closing a socket (from "network" package) is also uninterruptible under similar conditions. An example of an interruptible action is killThread. Completion of interruptible release actions can be ensured by wrapping them in uninterruptibleMask_, but this risks making the program non-responsive to Control-C, or timeouts.
And Applying mask to an exception handler says more:
Some operations are interruptible, which means that they can receive asynchronous exceptions even in the scope of a mask. Any function which may itself block is defined as interruptible; this includes takeMVar (but not tryTakeMVar), and most operations which perform some I/O with the outside world. ... It is useful to think of mask not as a way to completely prevent asynchronous exceptions, but as a way to switch from asynchronous mode to polling mode. The main difficulty with asynchronous exceptions is that they normally can occur anywhere, but within a mask an asynchronous exception is only raised by operations that are interruptible (or call other interruptible operations). In many cases these operations may themselves raise exceptions, such as I/O errors, so the caller will usually be prepared to handle exceptions arising from the operation anyway. To perform an explicit poll for asynchronous exceptions inside mask, use allowInterrupt.
Idea
So, could we use interruptible?
For example, could we do
gracefulClose s tmout0 = sendRecvFIN `E.finally` close s
where
sendRecvFIN = do
-- Sending TCP FIN.
ex <- E.try $ shutdown s ShutdownSend
case ex of
Left (E.SomeException _) -> return ()
Right () -> do
-- Giving CPU time to other threads hoping that
-- FIN arrives meanwhile.
yield
-- Waiting TCP FIN.
-- Make the graceful-close waiting `interruptible`
-- even when `gracefulClose` is called in `mask`, e.g. as
--
-- bracket (openSocket ...) gracefulClose $ \sock ...
--
-- If the timeout part gets interrupted by an async exception,
-- the `finally` above will close the socket.
interruptible (timeout lingerTimeout recvUntilEof)
(Note interruptible has the same implementation as the restore function of mask; both use unsafeUnmask.)
Your research is great!
Sorry but it seems to me that I misunderstand timeout.
Through some experiments, timeout works well in Control.Exception.bracket.
I need to understand why it works.
The reason why I misunderstand is probably that I use UnliftIO.Exception.bracket.
timeout does not work in it.
import Control.Exception
-- import UnliftIO.Exception
import Control.Concurrent
import System.Timeout
main :: IO ()
main = bracket newEmptyMVar (timeout 3000000 . takeMVar) $ \_ -> return ()
Since timeout works in bracket, recvEOFloop is now based on timeout.
recvEOFevent is nicer than recvEOFloop since recvEOFevent does not spawn a thread.
The reason timeout does not work with the unliftio version of bracket is that it runs the cleanup handler using uninterruptibleMask_, so even interruptible operations won't be interruptible anymore (see https://hackage.haskell.org/package/unliftio-0.2.25.0/docs/src/UnliftIO.Exception.html#bracket). I must admit that personally I don't like the design of unliftio; it uses functions with the same name (bracket, catch), but with subtly different semantics (uninterruptibleMask versus mask in bracket, only catching synchronous exceptions in catch). Not only are these differences subtle, they are very difficult to debug and can likely result in non-deterministic failures.
I have already removed the dependency for unliftio from my packages on github.
I'm now trying to remove it from warp.
I have already removed the dependency for
unliftiofrom my packages on github. I'm now trying to remove it fromwarp.
@kazu-yamamoto Not sure that's a good idea, see below:
I must admit that personally I don't like the design of
unliftio; it uses functions with the same name (bracket,catch), but with subtly different semantics (uninterruptibleMaskversusmaskinbracket, only catching synchronous exceptions incatch). Not only are these differences subtle, they are very difficult to debug and can likely result in non-deterministic failures.
@edsko That behaviour is arguably the correct thing to do in most cases, and the reason why using unliftio makes applications drastically less buggy.
Most uses of bracket do not intend to handle (and then not re-throw) async exceptions, but accidentally do, which makes the programs wrong (usually deadlock).
This is quite nicely explained in https://tech.fpcomplete.com/haskell/tutorial/exceptions/#sync-vs-async
An asynchronous exception is totally different. It is a demand from outside of our control to shut down as soon as possible. If we were to catch such an exception and recover from it, we would be breaking the expectations of the thread that tried to shut us down. Instead, with asynchronous exceptions, exception handling best practices tell us we're allowed to clean up, but not recover. For example, the timeout function uses asynchronous exceptions. What should the expected behavior here be? [Example]
The example shows exactly the case discussed in this issue: "timeout + delay-on-interruption"
In most application code, the additional delay should not happen; an async exception should reliably kill the entire thread. That should be the default; thus unliftio's timeout is good.
But gracefulClose has special semantics, where we want only the second exception to kill the thread. In that case, we need to use the normal timeout with it's pitfalls (but it's OK to use here, because we're implementing a very special case).
@kazu-yamamoto Thus my recommendation is be to stick to unliftio as the default, and use Control.Exception only for special cases such as "timeout + delay-on-interruption".
@kazu-yamamoto Thus my recommendation is be to stick to
unliftioas the default, and useControl.Exceptiononly for special cases such as "timeout + delay-on-interruption".
As a heavy user of unliftio, I should admit the following two:
- With
unliftio, some clean-up functions cannot work. Examples are not onlygracefulClosebut alsobyeoftls. When these functions are used,Control.Exceptionmust be used. - We cannot ask all package maintainers to introduce
unliftio. If we could useunliftioonly over packages relating our project, it would give us sense of security. However, the reality is not. In the real world, our project based on bothunliftioandControl.Exception. This introduces new anxiety.
That said, I would like to go with only Control.Exception and carefully write up exception handlers to rethrow asynchronous exceptions.
I'm going to adopt the asynchronous-exception-less programming with Control.Exception.
An example is gracefulClose in the case where EventManager is available.
If not available, timeout is used.
I would like to go with only
Control.Exceptionand carefully write up exception handlers to rethrow asynchronous exceptions.
I think any choice you make for network in this regard is fine -- as long as the person that writes the code is aware what happens (and you are), both approaches are OK. I just wanted to support the idea of unliftio in general, because I think it's the saner default.
I found no problems in my field testing. So, I have merged this PR and released a new version. If we find problems, let's open a new issue.
Unfortunately, I saw CLOSE_WAIT today.
I will open a new issue.