haskell-socket icon indicating copy to clipboard operation
haskell-socket copied to clipboard

TCP Send Mangles ByteStrings

Open andrewthad opened this issue 7 years ago • 3 comments

I'm using socket-0.7 with GHC-8.0 and building on NixOS. The changelog for 0.8 does not indicate that this was addressed in the new release, but if upgrading will fix this, let me know. The application we are writing sends data to carbon-cache (part of the graphite technology stack) using a plaintext protocol. Basically, we open up a TCP socket, do a ton of sendAll calls (no recvs), and then close the socket. What goes wrong is that the haskell-socket library occassionally mangles bytes that it's sending across the wire. I have confirmed this by using tcpflow. I cannot give an actual example because the data contains confidential information, but here is something similar to what is happening to the TCP stream.

Expected:

foo.bar.baz.node1.metric 1.98 1490804745
foo.bar.baz.node2.metric 2.16 1490804745
foo.bar.baz.node3.metric 2.04 1490804745

Actual:

foo.bar.baz.node1.metric 1.98 1490804745
foo.bar.baz.node2.metfoo.bar.baz.node56 2.43 14908047360804745
foo.bar.baz.node3.metric 2.04 1490804745

Just for extra clarity, the second line has had a fragment deleted from it and another another line has replaced that fragment:

foo.bar.baz.node2.met[[[foo.bar.baz.node56 2.43 1490804736]]]0804745

Basically, another line from somewhere else in the TCP stream shows up in the middle of the line. Here is some additional information that may be helpful:

  • This happens regardless of whether we use sendAll or sendAllBuilder.
  • This only happens when using a real network interface. This issue is never manifested when using the loopback interface. In the application my team works on, we have a TCP connection to localhost and to a remote host. We send the same metrics to both. Only the metrics going to the remote host get mangled. This makes me suspect that there is a subtle concurrency issue. The loopback interface is probably fast enough to hide it.
  • The application is multi-threaded and makes concurrent calls to sendAll.
  • The frequency of mangled-line-occurences is about thirty per hour (out of the 5 million lines sent every hour).

That's everything I know. I've looked through the code a little, and I cannot see any obvious issues. If there's any additional information that I could provide, let me know.

andrewthad avatar Mar 29 '17 18:03 andrewthad

@mckeankylej

andrewthad avatar Mar 29 '17 18:03 andrewthad

Hi Andrew,

thanks for the report. I think the problem is that sendAll etc are not thread-safe. The behavior you're observing can occur if the TCP send buffer is nearly full and a single send system call returns an n where 0 < n < bytestring_len. In this case sendAll would wait until the socket becomes writable again and then tries to transmit the remaining bytes of the string.

I confess this is a little counter-intuitive and this behavior is not explicitly documented. It is safe though to read and write simultaneously using two threads. The solution is therefor to serialize the writes (for example by using an MVar and one extra thread).

A little more explanation: The lock on the socket only protects single system calls. If the send syscall signals a partial write one has to wait until the socket becomes writable again. It is a design decision that the socket must not be locked while waiting. Otherwise the socket couldn't be closed or read from by a different thread.

lpeterse avatar Mar 29 '17 19:03 lpeterse

Thanks! That totally makes sense. Now that you point that out, I'm not sure why I even expected that sendAll would be thread-safe. I'll try to PR some documentation warning about this.

andrewthad avatar Mar 29 '17 20:03 andrewthad