old-raft-rs Flaky tests on OSX

Many of the server.rs tests will fail eventually if run in a loop on OSX. The problem manifests as failed calls to stream_shutdown or client_connected. The test may also hang indefinitely, which is a side effect of how we are currently testing for shutdown in stream_shutdown.

It's unclear at this point whether this is due to misuse of the MIO APIs in server.rs, because of a bug in MIO, or even kqueue itself.

Potentially incomplete list of flaky tests:

test_invalid_client_message
test_invalid_accept
test_unreachable_peer_reconnect
test_peer_accept

Jul 16 '15 03:07 danburkert

same for peer_connected:

server::tests::test_invalid_peer_message:

thread 'server::tests::test_invalid_peer_message' panicked at 'assertion failed: peer_connected(&server, peer_id)', src/server.rs:742

Unfortunately calling this "flaky" is currently an understatement. I'm hoping that's just on my machine, but I maybe get a green run 5% of the time now.

Disclaimer: I don't really know what I'm doing. But I poked a little bit and this bit of code seems suspicious:

    /// Reconnects to the given peer ID and sends the preamble, advertising the
    /// given local address to the peer.
    pub fn reconnect_peer(&mut self, id: ServerId, local_addr: &SocketAddr) -> Result<()> {
        scoped_assert!(self.kind.is_peer());
        self.stream = Some(MessageStream::new(try!(TcpStream::connect(&self.addr)),
                                              ReaderOptions::new()));
        scoped_trace!("{:?}: reconnected to {:?} with {:?}", self, self.addr, self.stream);
        try!(self.send_message(messages::server_connection_preamble(id, local_addr)));
        Ok(())
    }

It opens a new connection and then immediately calls send_message, which calls write on the new stream. I have this toy code in my server.rs (pardon my french):

    #[test]
    fn test_shit() {
        use std::thread;
        use mio::tcp::TcpStream;
        setup_test!("shit");

        let peer_listener = TcpListener::bind("127.0.0.1:53840").unwrap();
        let addr = peer_listener.local_addr().unwrap().clone();
        let child = thread::spawn(move || {
            let (mut stream_a, _)  = peer_listener.accept().unwrap();
            scoped_debug!("accepted");
            stream_a.write(b"hi");
            stream_a.flush().unwrap();
            scoped_debug!("wrote");
        });
        thread::sleep(Duration::new(1,0));
        scoped_debug!("connecting");
        let mut stream = TcpStream::connect(&addr).unwrap();
        //thread::sleep(Duration::new(1,0)); // the magic sleep
        scoped_debug!("writing");
        stream.write(b"foo bar baz").unwrap();
        scoped_debug!("wrote");
    }

Without the "magic" sleep,

---- server::tests::test_shit stdout ----
    thread 'server::tests::test_shit' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 57, message: "Socket is not connected" } }', ../src/libcore/result.rs:688

With the magic sleep, it passes. Err 57 is also what I'm seeing in the failing tests above. I'm thinking this may be a simple case of asynchronous connect() having to wait for an event of some sorts (or take the error as a hint to queue the message and retry later). capnp-nonblock has an outbound queue, but a cursory check suggests that it just calls write and bails on that error. @danburkert?

Dec 13 '15 09:12 tbg

Ah interesting. Yes, part of my change with capnp-nonblock introduced more aggressive writing behavior. Before when we wrote a message to a connection with an empty write queue, we would register interest in writable events on the socket and wait for the interest to fire. capnp-nonblock instead immediately starts writing the message if the write-queue is empty, under the assumption that the connection almost certainly has write buffer space available, and it's wasteful to wait for the callback (latency-wise). I don't have anything specific to point to, but I'm pretty sure this is common behavior on epoll systems. Perhaps this isn't kosher on kqueue, at least when the connection is new.

Dec 13 '15 20:12 danburkert

This is probably easiest to diagnose at the capnp-nonblock level, it should be possible to create a unit test there.

Dec 13 '15 20:12 danburkert

err well maybe not. capnp-nonblock doesn't have any MIO based tests, but it does have a MIO example. maybe we can recreate with that simpler example.

Dec 13 '15 20:12 danburkert

Yeah, capnp-nonblock is probably where this should live in the end. For now, this is already super condensed, though:

let mut stream = TcpStream::connect(&addr).unwrap();
stream.write(b"foo bar baz").unwrap();

and this is was capnp-nonblock will do (or at least once we understand the above, we should be able to sort the rest out and write a proper repro there).

It just seems wrong to me to immediately start writing to a socket that has O_NONBLOCK set (at least that's what this example indicates). What would be the expected behavior there? EAGAIN or EWOULDBLOCK? Maybe the right thing to do is to keep state in capnp-nonblock for the first time it becomes writable, and to queue before that happens. Are we even hitting kqueue at the point the above snippet fails?

Some more stuff that doesn't matter:

connect goes down to nix::connect(io.as_raw_fd(), addr) via the following chain (so the socket will have nix::SOCK_NONBLOCK | nix::SOCK_CLOEXEC, no surprises there).

    fn new(family: nix::AddressFamily) -> io::Result<TcpSocket> {
        net::socket(family, nix::SockType::Stream, true)
            .map(|fd| From::from(Io::from_raw_fd(fd)))
    }

pub fn socket(family: nix::AddressFamily, ty: nix::SockType, nonblock: bool) -> io::Result<RawFd> {
    let opts = if nonblock {
        nix::SOCK_NONBLOCK | nix::SOCK_CLOEXEC
    } else {
        nix::SOCK_CLOEXEC
    };

    nix::socket(family, ty, opts)
        .map_err(super::from_nix_error)
}

```rust
pub fn connect(fd: RawFd, addr: &SockAddr) -> Result<()> {
    let res = unsafe {
        let (ptr, len) = addr.as_ffi_pair();
        ffi::connect(fd, ptr, len)
    };

    from_ffi(res) // converts error int to Result<()>
}

Dec 14 '15 02:12 tbg

It just seems wrong to me to immediately start writing to a socket that has O_NONBLOCK set (at least that's what this example indicates). What would be the expected behavior there? EAGAIN or EWOULDBLOCK?

The thinking here is that the socket has a write buffer, and only once that write buffer is full should it return EWOULDBLOCK. So in the normal case you can write to the socket and start sending data immediately. The whole message may not fit in the socket buffer (EWOULDBLOCK), in which case the connection will register an interest in writable and continue sending at a later time. This basically cuts out a potential spin of the event loop, thus cutting a small bit of latency.

Maybe the right thing to do is to keep state in capnp-nonblock for the first time it becomes writable, and to queue before that happens.

That's a good workaround if it turns out to not be possible to do the above, but my understanding was that it was a common thing to do with MIO.

Are we even hitting kqueue at the point the above snippet fails?

No, I think the socket has not yet been registered with kqueue. That should be OK, though. kqueue is just responsible for alerting when the socket is readable or writable; you can still call read/write on it at other times, you just risk getting EWOULDBLOCK immediately.

Dec 14 '15 03:12 danburkert

OK it definitely seems like this is invalid on OS X. I changed one of the MIO unit tests to immediately write to a socket, and it now fails on OS X (but still passes on Linux). I don't really want to give up the aggressive write optimization, so we need to have another way to know when the socket is connected.

Dec 14 '15 03:12 danburkert

Alternatively, capnp-nonblock can swallow the not-connected error when doing the aggressive write. This is probably the most foolproof way, just need to make sure it won't swallow real errors.

Dec 14 '15 03:12 danburkert

just brought this up on carllerche/mio#307.

Dec 14 '15 03:12 danburkert

wouldn't that just swallow the real error of not being able to connect? In that case I would expect to get 57 back indefinitely. Curious what the mio guys will say.

Dec 14 '15 03:12 tbg

@tschottdorf I'm proposing the error only be swallowed when doing the aggressive write. At that point the connection would be registered with the event loop. If it's a real connection issue it will get an error event at that point. At least that's my hypothesis, I'll test it out.

Dec 14 '15 03:12 danburkert

Just pushed a new version of capnp-nonblock that seems to have fixed the issue. Still getting random errors on OS X, but I think it's back to the original issue.

Dec 14 '15 04:12 danburkert

For future reference: https://github.com/danburkert/capnp-nonblock/commit/47343ce3b2cc5128a7bf2e738a1cd24b3819131d

Dec 14 '15 04:12 danburkert

:+1:. Are you pushing a new version & updating this crate?

Dec 14 '15 04:12 tbg

Yep, capnp-nonblock has been updated to 0.2.1, it should get pulled in automatically by raft once you cargo update.

Dec 14 '15 04:12 danburkert

works like a charm (and by that I mean, it's back to the expected flakiness).

Dec 14 '15 04:12 tbg

:)

Dec 14 '15 04:12 danburkert

old-raft-rs old-raft-rs copied to clipboard

Flaky tests on OSX

old-raft-rs
old-raft-rs copied to clipboard