network Thread safety of Datagram send and receive

The package documentation claims that send and receive operations on the same socket are not thread safe:

The proper programming model is that one Socket is handled by a single thread. If multiple threads use one Socket concurrently, unexpected things would happen. There is one exception for multiple threads vs a single Socket: one thread reads data from a Socket only and the other thread writes data to the Socket only.

In a discussion on StackOverflow though the opinion was brought up that in POSIX, send, sendTo, receive and receiveFrom are defined as atomic operations, making them thread safe at least for Datagram sockets.

With this library mostly being a thin wrapper around the Berkeley sockets API: Are operations on the same Datagram socket atomic? If not, why?

Apr 06 '20 11:04 schmittlauch

I'm not sure that these APIs for UDP are thread-safe because they would involve the IO manager.

Apr 08 '20 06:04 kazu-yamamoto

because they would involve the IO manager.

Can you elaborate on this? AFAIK the GHC IO manager is built around epoll, notifying the runtime once new data is available on a socket and operations built on it might be unblocked. So my guess is that if the OS API is still designed in a way to make sendTo and recvFrom atomic, epoll notifications will only be triggered once these atomic operations have finished each.

The reason I'm bringing this up is that I'd like to handle many UDP requests concurrently. While an accept on Stream Sockets always creates a new socket that can be passed to a new thread, with Datagram sockets I have to decide on passing the same socket or creating a short-lived socket for each forked thread (causes more context switches, makes sent packets have a different source port which could be a problem with firewalls). From my understanding, just using a single sender and single receiver connected by a TChan won't be sufficient, as the single sender might still be bound by the latency of the socket. Or does the "fire and forget" approach make socket latency issues better?

Apr 08 '20 11:04 schmittlauch

@kazu-yamamoto I found a discussion on that topic that suspects the reason behind the non-thread-safety remark in the documentation to be the close behaviour of fdSockets, introduced in 30b7d7971b5e60da13dcb3c58244c66f8a308380. So that would imply that re-using sockets accross threads is fine as long as they are no fdSockets.

Apr 08 '20 11:04 schmittlauch

While digging through some Haskell project code dealing with UDP I've found several code bases where multiple threads use the same Datagram socket concurrently.

All of them are from earlier then 2018 and thus before the commit introducing that note into the docs. Still it looks like a significant change where it is weird that the reasoning behind it is so hardly apparent.

Apr 08 '20 22:04 schmittlauch

@schmittlauch First of all, if you would like to use multiple workers after accepting a so-called listen socket for UDP, please give a look at my blog article.

https://kazu-yamamoto.hatenablog.jp/entry/2020/02/18/145038

The "New connections" section describes how to use UDP like TCP.

Apr 08 '20 23:04 kazu-yamamoto

Second, I don't worry about close this time. It's a issue when sockets are reused.

Apr 08 '20 23:04 kazu-yamamoto

Third, IO manager.

When recvfrom(2) in recvFrom successes, there is no problem.

Otherwise, recvFrom asks the IO manager to wake up itself when data are available. Suppose that another instance of recvFrom also asks the same thing to the IO manager. When data are available, what happens? I'm not sure. I guess a race happens. The winner can read the data. The loser's recvfrom(2) fails again. I'm not sure that this failure is safely handled. We need to check.

Apr 08 '20 23:04 kazu-yamamoto

@vdukhovni might be interested in this topic.

Apr 09 '20 00:04 kazu-yamamoto

@schmittlauch Note that the technique described in my blog is NAT-friendly.

Apr 09 '20 00:04 kazu-yamamoto

NAT-friendly code is available here: https://github.com/kazu-yamamoto/quic/blob/master/Network/QUIC/Socket.hs#L29

Of course, this approach is fragile to NAT-rebinding.

Apr 09 '20 00:04 kazu-yamamoto

Network.Run.UDP.runUDPServerFork is the utility for this purpose.

https://hackage.haskell.org/package/network-run-0.2.2/docs/Network-Run-UDP.html

Apr 09 '20 00:04 kazu-yamamoto

@vdukhovni might be interested in this topic.

I'm willing to comment.

Indeed for UDP sockets with multiple polling threads, without some sort of locking (and consequent overhead) you get thundering herd if you just naïvely share the UDP server socket. Nothing breaks, but it scales very poorly when there are many idle threads waiting for a message. All the threads get woken up, one reads the message, the rest go back to epoll()/kqueue(), ... when they discover that there's no message to read.

The solution is to give each thread its own socket with SO_REUSEPORT. Below is an example. While the naïve code (not shown) makes ~n^2 calls to recvfrom() when doing n roundtrips of ping -> pong, the below code makes ~2n calls (one immediate, and a second via kqueue()/epoll() when the data is not immediately available). The sockets are not shared for performance rather than correctness reasons. If you want poor performance go with the naïve approach. :-)

{-# LANGUAGE BlockArguments #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Concurrent
import Control.Monad
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Builder as B
import qualified Data.ByteString.Lazy as LB
import Network.Socket
import Network.Socket.ByteString

main :: IO ()
main = do
    let hints = defaultHints { addrFlags = [AI_NUMERICHOST, AI_NUMERICSERV]
                             , addrSocketType = Datagram }
    ais <- getAddrInfo (Just hints) (Just "127.0.0.1") (Just "12345")
    forM_ ais \ai -> do
        forM_ [1..64] \i -> do
            socket (addrFamily ai) (addrSocketType ai) (addrProtocol ai) >>= \s -> do
                setSocketOption s ReusePort 1
                bind s $ addrAddress ai
                forkIO (pong s i)
    ping ais
    threadDelay 1000000

ping :: [AddrInfo] -> IO ()
ping ais = do
    forM_ ais \ai -> do
        socket (addrFamily ai) (addrSocketType ai) (addrProtocol ai) >>= \s -> do
            forM_ [1..64] \i -> do
                threadDelay 5000
                let addr = addrAddress ai
                    obuf = LB.toStrict $ B.toLazyByteString $ B.byteString "ping " <> B.intDec i
                void $ sendTo s obuf addr
                recvFrom s 32 >>= B.putStrLn . fst

pong :: Socket -> Int -> IO ()
pong s i = do
    (ibuf, addr) <- recvFrom s 32
    void $ sendTo s (reply ibuf) addr
    close s
  where 
    reply ibuf =
        LB.toStrict $ B.toLazyByteString
                    $ B.byteString ibuf 
                   <> B.byteString ", pong "
                   <> B.intDec i

Apr 09 '20 07:04 vdukhovni

Bottom line, I think this issue could be closed. Yes send, sendTo, recv and recvFrom are thread-safe, but, given the IO-manager glue, racing multiple threads to read from the same socket performs poorly. You get much better results with ReusePort (on servers).

I don't know of a good reason for UDP clients to share a socket, you generally want each client to see just the responses to the query it sent, so a dedicated socket for each client thread also makes more sense.

Perhaps the documentation could be changed to discuss performance, rather than safety, maybe that'd be less confusing?

Apr 12 '20 07:04 vdukhovni

Finally, if the processing for each client request is expensive enough (high latency), ReusePort may lead to higher latency for client requests when a request lands on a thread currently processing a request. In such situations it makes sense to have a small number of threads handling the network, and placing requests in a shared queue, with a larger pool of worker threads processing the requests. It all depends on the ratio between packet arrival spacing and request service latency, context switch costs, ...

The right design may depend on the application, there may not be a single right answer. At C10m loads you may writing C-code or assembly to run on bare metal (no OS, no Haskell RTS, ...) and poll the interface for packets rather than take interrupts and context switches.

Apr 12 '20 07:04 vdukhovni

Thanks to both of you for your helpful comments.

Perhaps the documentation could be changed to discuss performance, rather than safety, maybe that'd be less confusing?

Yes, that'd be helpful.

I don't know of a good reason for UDP clients to share a socket, you generally want each client to see just the responses to the query it sent, so a dedicated socket for each client thread also makes more sense.

Forking a thread for each received message (maybe with a semaphor for limiting the maximum number) allows for more dynamic scaling. Instead of having a fixed amount of worker threads running independently from the actual application load, threads are spawned according to the actual demand.

it makes sense to have a small number of threads handling the network, and placing requests in a shared queue, with a larger pool of worker threads processing the requests.

With this approach, the sending of a response can be an issue: Once a Datagram Socket has been connected to a remote address, it only accepts responses from the same IP-port combination. So if messages are accepted through socket A bound to port X, then passed to a pool of worker threads and sent by another sender thread through socket B bound to port Y, clients might not accept the response because of the different source port. I ran into this when I tried to test my server using nc -u, because that uses connected UDP sockets. But when writing both client and server yourself, it is possible to use unconnected sockets only.

Apr 12 '20 12:04 schmittlauch

Perhaps the documentation could be changed to discuss performance, rather than safety, maybe that'd be less confusing?

Yes, that'd be helpful.

Would you like to suggest specific text in a pull request? Perhaps Yamamoto-san (@kazu-yamamoto) would be willing to review and merge less confusing text. You don't need to explain every detail, just take care of anything that's outright confusing or misleading, keeping the explanations reasonably concise.

Basically racing multiple readers for the same socket should be avoided for efficiency reasons even with datagram transports where atomicity of messages generally means that you don't sacrifice safety. You could briefly mention ReusePort as a more performant (recommended) way of scaling a UDP service to multiple CPUs.

Apr 13 '20 06:04 vdukhovni

@schmittlauch Would you like to improve the doc or just close this issue?

May 19 '20 02:05 kazu-yamamoto

@kazu-yamamoto I am still tinkering with sockets and networking, so I still need some time to gather experience. After that, I am willing to improve the docs.

May 19 '20 17:05 schmittlauch

@vdukhovni One more question: Is there any reason against sending and receiving in parallel on the same socket in 2 different threads, like receiving (and blocking) in one thread while sending (in exactly 1 other thread) at the same time concurrently?

As far as I understood the main reason against using sockets over multiple threads is that the performance degrades when blocking on receive, which doesn't happen when sending.

Otherwise I'd need to abort the blocking receiveFrom regularly with a timeout, then check whether there are new messages to be sent in a TQueue, and then receiveFrom again in the same thread.

May 29 '20 09:05 schmittlauch

@schmittlauch

One more question: Is there any reason against sending and receiving in parallel on the same socket in 2 different threads, like receiving (and blocking) in one thread while sending (in exactly 1 other thread) at the same time concurrently?

This approach is recommended as written in the manual of network. See blow:

There is one exception for multiple threads vs a single Socket: one thread reads data from a Socket only and the other thread writes data to the Socket only.

Jun 01 '20 00:06 kazu-yamamoto

Otherwise I'd need to abort the blocking receiveFrom regularly with a timeout, then check whether there are new messages to be sent in a TQueue, and then receiveFrom again in the same thread.

We can use two threads to separate sending and receiving safely. :-)

Jun 01 '20 00:06 kazu-yamamoto

We can use two threads to separate sending and receiving safely. :-)

So long as neither thread closes the socket while the other is still active, separate reader/writer threads are fine, one of each per socket.

Jun 01 '20 00:06 vdukhovni

just to give an estimate for when I'm going to do this: I hope to find time in October

Aug 19 '20 10:08 schmittlauch