tokio
tokio copied to clipboard
`EPOLLERR` does not wake up `AsyncFd::readable()`
Version
└── tokio v1.15.0
└── tokio-macros v1.7.0 (proc-macro)
├── tokio v1.15.0 (*)
Platform
Linux user 5.15.5-zen1-1-zen #1 ZEN SMP PREEMPT Thu, 25 Nov 2021 22:09:35 +0000 x86_64 GNU/Linux
Description
The Linux kernel raises EPOLLERR
when a FUSE session's connection is severed (usually by umount/fusermount). A non-blocking FUSE daemon can post a read interest via epoll to receive request packets and eventually shutdown on EPOLLERR
. However, the Tokio stack fails to properly handle this condition. A real example would involve the convoluted dance for unprivileged fusermount, but it should be exactly the same situation as what follows.
use tokio::io::{unix::AsyncFd, Interest};
use nix::{
fcntl::{fcntl, open, FcntlArg, OFlag},
mount::{mount, MsFlags},
sys::stat::Mode,
unistd::{read, write},
};
#[tokio::main(flavor = "current_thread")]
async fn main() {
let fuse = open("/dev/fuse", OFlag::O_RDWR, Mode::empty()).unwrap();
let fcntl_arg = FcntlArg::F_SETFL(OFlag::O_NONBLOCK | OFlag::O_LARGEFILE);
fcntl(fuse, fcntl_arg).unwrap();
let options = format!(
"fd={},user_id=0,group_id=0,allow_other,rootmode=40000",
fuse
);
mount(
Some("test"),
"/mnt",
Some("fuse"),
MsFlags::empty(),
Some(options.as_str()),
)
.unwrap();
let async_fd = AsyncFd::with_interest(fuse, Interest::READABLE).unwrap();
let mut buffer = [0; 8192];
loop {
let result = async_fd
.readable()
.await
.unwrap()
.try_io(|_| read(fuse, &mut buffer).map_err(Into::into));
match result {
Err(_) => continue,
Ok(result) => {
result.unwrap();
}
}
let opcode = u32::from_ne_bytes((&buffer[4..8]).try_into().unwrap());
println!("opcode: {}", opcode);
let out = match opcode {
26 => OUT_INIT, // handshake
38 => break, // destroy
_ => OUT_ENOSYS, // return ENOSYS for everything else
};
(&mut buffer[..8]).copy_from_slice(&out[..8]);
(&mut buffer[16..out.len()]).copy_from_slice(&out[16..]);
write(fuse, &buffer[..out.len()]).unwrap();
}
}
const OUT_INIT: &[u8] = &[
80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 176, 31, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
];
const OUT_ENOSYS: &[u8] = &[16, 0, 0, 0, 218, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0];
The previous program mounts and then responds with -ENOSYS
to all requests except init and destroy. Once you umount the filesystem, EPOLLERR
happens (which corresponds to a destroy request) and the daemon hangs on readable().await
. This is the strace log upon umount:
epoll_wait(3, [{events=EPOLLIN, data={u32=0, u64=0}}], 1024, -1) = 1
write(4, "\1\0\0\0\0\0\0\0", 8) = 8
read(6, "8\0\0\0\3\0\0\0\10\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 56
write(1, "opcode: 3\n", 10opcode: 3
) = 10
write(6, "\20\0\0\0\332\377\377\377\10\0\0\0\0\0\0\0", 16) = 16
read(6, 0x7ffdc43d6e55, 8192) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{events=EPOLLIN, data={u32=2147483648, u64=2147483648}}], 1024, -1) = 1
epoll_wait(3, [{events=EPOLLERR, data={u32=0, u64=0}}], 1024, -1) = 1
epoll_wait(3, <deadlocks>
This issue seems to be related to #2218. Note that this other case with writable()
works as expected (main exits immediately), so I think it has something to do with read interests and not general handling of EPOLLERR
:
use tokio::io::unix::AsyncFd;
use nix::unistd::{pipe, write, close};
#[tokio::main(flavor = "current_thread")]
async fn main() {
let (rx, tx) = pipe().unwrap();
let async_tx = AsyncFd::new(tx).unwrap();
// Fill kernel buffer so that epoll_wait() brings up EPOLLERR without EPOLLOUT
write(tx, &[0; 65536]).unwrap();
close(rx).unwrap();
// Properly wakes up on EPOLLERR
async_tx.writable().await.unwrap().retain_ready();
}
Thoughts? @Thomasdezeeuw
@Darksonn my first instinct would be that Tokio isn't checking Event::is_error
when handling read events, but I don't really know where the related code is so it's hard for me to say. If you can point to the related code I can look into it further.
Looking at Ready::from_mio
it doesn't seem to consider mio::Error::is_error
. So it already goes wrong in the call here: https://github.com/tokio-rs/tokio/blob/78e0f0b42a4f7a50f3986f576703e5a3cb473b79/tokio/src/io/driver/mod.rs#L174
What should the fix do? Should async_fd.readable().await
return an error whenever the EPOLLERR
happens? Or should it return success (and the user will get the error when attempting a read)?
The first one makes more sense to me, but I want to hear what you think before working on an implementation.
What should the fix do? Should
async_fd.readable().await
return an error whenever theEPOLLERR
happens? Or should it return success (and the user will get the error when attempting a read)?The first one makes more sense to me, but I want to hear what you think before working on an implementation.
If EPOLLERR
is returned went something went wrong, so the most logical path would be to return the error to the user. Getting the error can be a bit tricky though. For sockets we have SO_ERROR
, but I'm not sure a (fd) generic version of that exists.
If the error can't be retrieved we could pretend the fd is readable and let the user perform a read operation, that should also return the error. However I'm not sure it's guaranteed, for example I remember fsync
in the past loosing errors (on Linux) but I'm a bit fussy on it.
If EPOLLERR is returned went something went wrong, so the most logical path would be to return the error to the user. Getting the error can be a bit tricky though. For sockets we have SO_ERROR, but I'm not sure a (fd) generic version of that exists.
Could we not return an error to the user that basically just says "EPOLLERR happened", and let them retrieve the actual underlying error in the way the find appropriate? So, for instance, they could decide to read SO_ERROR
if it's sockets they are dealing with. Or they would attempt a read
if they don't have better options
What should the fix do? Should
async_fd.readable().await
return an error whenever theEPOLLERR
happens? Or should it return success (and the user will get the error when attempting a read)?The first one makes more sense to me, but I want to hear what you think before working on an implementation.
The current behavior of async_fd.writable().await
is to succeed (try the last test in my original post, it unwrap()
s without panic). So I think it would be quite inconsistent for readable().await
to not do exactly the same. On the other hand, this behavior does not seem to be documented at all, so the more correct way of returning a "EPOLLERR happened" error is feasible as long as the current writable()
behavior is treated as another bug.
Could we not return an error to the user that basically just says "EPOLLERR happened", and let them retrieve the actual underlying error in the way the find appropriate? So, for instance, they could decide to read
SO_ERROR
if it's sockets they are dealing with. Or they would attempt aread
if they don't have better options
Possibly, but then it would have to be mapped to io::Error
and the documentation spell out that itself is not really an error (or at least not the source of the error).
The current behavior of async_fd.writable().await is to succeed (try the last test in my original post, it unwrap()s without panic).
This can be traced down to https://github.com/tokio-rs/mio/blob/dca2134ef355b3c0d00e8e338e44e7d9ed63edac/src/sys/unix/selector/epoll.rs#L180-L187 https://github.com/tokio-rs/mio/blob/05009e4f60335fa00e9ea6a118595548afee0607/src/event/event.rs#L116-L123
An event with EPOLLERR
is considered is_write_closed
, which wakes up awaiters with a write interest. However, EPOLLERR
does not affect is_read_closed
.
Maybe, given this issue, it should? Or would that mask errors that users would want to handle explicitly? If so, which sounds reasonable, why does that same logic not apply to is_write_closed
?
If EPOLLER
affected is_read_closed
in mio, that would lead (I think, haven't checked) to @3442's readable().await
being woken up and then the next read
would fail.
This can be traced down to https://github.com/tokio-rs/mio/blob/dca2134ef355b3c0d00e8e338e44e7d9ed63edac/src/sys/unix/selector/epoll.rs#L180-L187 https://github.com/tokio-rs/mio/blob/05009e4f60335fa00e9ea6a118595548afee0607/src/event/event.rs#L116-L123
An event with
EPOLLERR
is consideredis_write_closed
, which wakes up awaiters with a write interest. However,EPOLLERR
does not affectis_read_closed
.
EPOLLERR
also triggers is_error
. It trigger is_write_closed
because of the epoll
manual, which states:
Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.
See https://github.com/tokio-rs/mio/pull/1350.
Maybe, given this issue, it should? Or would that mask errors that users would want to handle explicitly? If so, which sounds reasonable, why does that same logic not apply to
is_write_closed
?
No is_read_closed
and is_error
are two different things. Tokio should simply check is_error
, Mio is not to blame here.
If
EPOLLER
affectedis_read_closed
in mio, that would lead (I think, haven't checked) to @3442'sreadable().await
being woken up and then the nextread
would fail.
EPOLLERR also triggers is_error. It trigger is_write_closed because of the epoll manual, which states:
Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.
Got it
I'll close this as solved by #5781. But let me know if I've missed something.