old-raft-rs
old-raft-rs copied to clipboard
Flaky tests on OSX
Many of the server.rs tests will fail eventually if run in a loop on OSX. The
problem manifests as failed calls to stream_shutdown
or client_connected
.
The test may also hang indefinitely, which is a side effect of how we are
currently testing for shutdown in stream_shutdown
.
It's unclear at this point whether this is due to misuse of the MIO APIs in server.rs, because of a bug in MIO, or even kqueue itself.
Potentially incomplete list of flaky tests:
-
test_invalid_client_message
-
test_invalid_accept
-
test_unreachable_peer_reconnect
-
test_peer_accept
same for peer_connected
:
server::tests::test_invalid_peer_message
:
thread 'server::tests::test_invalid_peer_message' panicked at 'assertion failed: peer_connected(&server, peer_id)', src/server.rs:742
Unfortunately calling this "flaky" is currently an understatement. I'm hoping that's just on my machine, but I maybe get a green run 5% of the time now.
Disclaimer: I don't really know what I'm doing. But I poked a little bit and this bit of code seems suspicious:
/// Reconnects to the given peer ID and sends the preamble, advertising the
/// given local address to the peer.
pub fn reconnect_peer(&mut self, id: ServerId, local_addr: &SocketAddr) -> Result<()> {
scoped_assert!(self.kind.is_peer());
self.stream = Some(MessageStream::new(try!(TcpStream::connect(&self.addr)),
ReaderOptions::new()));
scoped_trace!("{:?}: reconnected to {:?} with {:?}", self, self.addr, self.stream);
try!(self.send_message(messages::server_connection_preamble(id, local_addr)));
Ok(())
}
It opens a new connection and then immediately calls send_message
, which calls write
on the new stream. I have this toy code in my server.rs
(pardon my french):
#[test]
fn test_shit() {
use std::thread;
use mio::tcp::TcpStream;
setup_test!("shit");
let peer_listener = TcpListener::bind("127.0.0.1:53840").unwrap();
let addr = peer_listener.local_addr().unwrap().clone();
let child = thread::spawn(move || {
let (mut stream_a, _) = peer_listener.accept().unwrap();
scoped_debug!("accepted");
stream_a.write(b"hi");
stream_a.flush().unwrap();
scoped_debug!("wrote");
});
thread::sleep(Duration::new(1,0));
scoped_debug!("connecting");
let mut stream = TcpStream::connect(&addr).unwrap();
//thread::sleep(Duration::new(1,0)); // the magic sleep
scoped_debug!("writing");
stream.write(b"foo bar baz").unwrap();
scoped_debug!("wrote");
}
Without the "magic" sleep,
---- server::tests::test_shit stdout ----
thread 'server::tests::test_shit' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 57, message: "Socket is not connected" } }', ../src/libcore/result.rs:688
With the magic sleep, it passes. Err 57 is also what I'm seeing in the failing tests above.
I'm thinking this may be a simple case of asynchronous connect()
having to wait for an event of some sorts (or take the error as a hint to queue the message and retry later). capnp-nonblock
has an outbound queue, but a cursory check suggests that it just calls write
and bails on that error. @danburkert?
Ah interesting. Yes, part of my change with capnp-nonblock
introduced more aggressive writing behavior. Before when we wrote a message to a connection with an empty write queue, we would register interest in writable events on the socket and wait for the interest to fire. capnp-nonblock
instead immediately starts writing the message if the write-queue is empty, under the assumption that the connection almost certainly has write buffer space available, and it's wasteful to wait for the callback (latency-wise). I don't have anything specific to point to, but I'm pretty sure this is common behavior on epoll systems. Perhaps this isn't kosher on kqueue, at least when the connection is new.
This is probably easiest to diagnose at the capnp-nonblock
level, it should be possible to create a unit test there.
err well maybe not. capnp-nonblock
doesn't have any MIO based tests, but it does have a MIO example. maybe we can recreate with that simpler example.
Yeah, capnp-nonblock
is probably where this should live in the end. For now, this is already super condensed, though:
let mut stream = TcpStream::connect(&addr).unwrap();
stream.write(b"foo bar baz").unwrap();
and this is was capnp-nonblock
will do (or at least once we understand the above, we should be able to sort the rest out and write a proper repro there).
It just seems wrong to me to immediately start writing to a socket that has O_NONBLOCK
set (at least that's what this example indicates). What would be the expected behavior there? EAGAIN
or EWOULDBLOCK
? Maybe the right thing to do is to keep state in capnp-nonblock
for the first time it becomes writable, and to queue before that happens. Are we even hitting kqueue
at the point the above snippet fails?
Some more stuff that doesn't matter:
connect
goes down to nix::connect(io.as_raw_fd(), addr)
via the following chain (so the socket will have nix::SOCK_NONBLOCK | nix::SOCK_CLOEXEC
, no surprises there).
fn new(family: nix::AddressFamily) -> io::Result<TcpSocket> {
net::socket(family, nix::SockType::Stream, true)
.map(|fd| From::from(Io::from_raw_fd(fd)))
}
pub fn socket(family: nix::AddressFamily, ty: nix::SockType, nonblock: bool) -> io::Result<RawFd> {
let opts = if nonblock {
nix::SOCK_NONBLOCK | nix::SOCK_CLOEXEC
} else {
nix::SOCK_CLOEXEC
};
nix::socket(family, ty, opts)
.map_err(super::from_nix_error)
}
```rust
pub fn connect(fd: RawFd, addr: &SockAddr) -> Result<()> {
let res = unsafe {
let (ptr, len) = addr.as_ffi_pair();
ffi::connect(fd, ptr, len)
};
from_ffi(res) // converts error int to Result<()>
}
It just seems wrong to me to immediately start writing to a socket that has O_NONBLOCK set (at least that's what this example indicates). What would be the expected behavior there? EAGAIN or EWOULDBLOCK?
The thinking here is that the socket has a write buffer, and only once that write buffer is full should it return EWOULDBLOCK
. So in the normal case you can write to the socket and start sending data immediately. The whole message may not fit in the socket buffer (EWOULDBLOCK
), in which case the connection will register an interest in writable and continue sending at a later time. This basically cuts out a potential spin of the event loop, thus cutting a small bit of latency.
Maybe the right thing to do is to keep state in capnp-nonblock for the first time it becomes writable, and to queue before that happens.
That's a good workaround if it turns out to not be possible to do the above, but my understanding was that it was a common thing to do with MIO.
Are we even hitting kqueue at the point the above snippet fails?
No, I think the socket has not yet been registered with kqueue. That should be OK, though. kqueue is just responsible for alerting when the socket is readable or writable; you can still call read/write on it at other times, you just risk getting EWOULDBLOCK
immediately.
OK it definitely seems like this is invalid on OS X. I changed one of the MIO unit tests to immediately write to a socket, and it now fails on OS X (but still passes on Linux). I don't really want to give up the aggressive write optimization, so we need to have another way to know when the socket is connected.
Alternatively, capnp-nonblock
can swallow the not-connected error when doing the aggressive write. This is probably the most foolproof way, just need to make sure it won't swallow real errors.
just brought this up on carllerche/mio#307.
wouldn't that just swallow the real error of not being able to connect? In that case I would expect to get 57
back indefinitely. Curious what the mio guys will say.
@tschottdorf I'm proposing the error only be swallowed when doing the aggressive write. At that point the connection would be registered with the event loop. If it's a real connection issue it will get an error event at that point. At least that's my hypothesis, I'll test it out.
Just pushed a new version of capnp-nonblock
that seems to have fixed the issue. Still getting random errors on OS X, but I think it's back to the original issue.
For future reference: https://github.com/danburkert/capnp-nonblock/commit/47343ce3b2cc5128a7bf2e738a1cd24b3819131d
:+1:. Are you pushing a new version & updating this crate?
Yep, capnp-nonblock
has been updated to 0.2.1, it should get pulled in automatically by raft once you cargo update
.
works like a charm (and by that I mean, it's back to the expected flakiness).
:)