tokio `EPOLLERR` does not wake up `AsyncFd::readable()`

Version

└── tokio v1.15.0
    └── tokio-macros v1.7.0 (proc-macro)
├── tokio v1.15.0 (*)

Platform Linux user 5.15.5-zen1-1-zen #1 ZEN SMP PREEMPT Thu, 25 Nov 2021 22:09:35 +0000 x86_64 GNU/Linux

Description The Linux kernel raises EPOLLERR when a FUSE session's connection is severed (usually by umount/fusermount). A non-blocking FUSE daemon can post a read interest via epoll to receive request packets and eventually shutdown on EPOLLERR. However, the Tokio stack fails to properly handle this condition. A real example would involve the convoluted dance for unprivileged fusermount, but it should be exactly the same situation as what follows.

use tokio::io::{unix::AsyncFd, Interest};

use nix::{
    fcntl::{fcntl, open, FcntlArg, OFlag},
    mount::{mount, MsFlags},
    sys::stat::Mode,
    unistd::{read, write},
};

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let fuse = open("/dev/fuse", OFlag::O_RDWR, Mode::empty()).unwrap();
    let fcntl_arg = FcntlArg::F_SETFL(OFlag::O_NONBLOCK | OFlag::O_LARGEFILE);
    fcntl(fuse, fcntl_arg).unwrap();

    let options = format!(
        "fd={},user_id=0,group_id=0,allow_other,rootmode=40000",
        fuse
    );

    mount(
        Some("test"),
        "/mnt",
        Some("fuse"),
        MsFlags::empty(),
        Some(options.as_str()),
    )
    .unwrap();

    let async_fd = AsyncFd::with_interest(fuse, Interest::READABLE).unwrap();

    let mut buffer = [0; 8192];
    loop {
        let result = async_fd
            .readable()
            .await
            .unwrap()
            .try_io(|_| read(fuse, &mut buffer).map_err(Into::into));

        match result {
            Err(_) => continue,
            Ok(result) => {
                result.unwrap();
            }
        }

        let opcode = u32::from_ne_bytes((&buffer[4..8]).try_into().unwrap());
        println!("opcode: {}", opcode);

        let out = match opcode {
            26 => OUT_INIT,  // handshake
            38 => break,     // destroy
            _ => OUT_ENOSYS, // return ENOSYS for everything else
        };

        (&mut buffer[..8]).copy_from_slice(&out[..8]);
        (&mut buffer[16..out.len()]).copy_from_slice(&out[16..]);
        write(fuse, &buffer[..out.len()]).unwrap();
    }
}

const OUT_INIT: &[u8] = &[
    80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 176, 31, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
];

const OUT_ENOSYS: &[u8] = &[16, 0, 0, 0, 218, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0];

The previous program mounts and then responds with -ENOSYS to all requests except init and destroy. Once you umount the filesystem, EPOLLERR happens (which corresponds to a destroy request) and the daemon hangs on readable().await. This is the strace log upon umount:

epoll_wait(3, [{events=EPOLLIN, data={u32=0, u64=0}}], 1024, -1) = 1
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
read(6, "8\0\0\0\3\0\0\0\10\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 56
write(1, "opcode: 3\n", 10opcode: 3
)             = 10
write(6, "\20\0\0\0\332\377\377\377\10\0\0\0\0\0\0\0", 16) = 16
read(6, 0x7ffdc43d6e55, 8192)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{events=EPOLLIN, data={u32=2147483648, u64=2147483648}}], 1024, -1) = 1
epoll_wait(3, [{events=EPOLLERR, data={u32=0, u64=0}}], 1024, -1) = 1
epoll_wait(3, <deadlocks>

This issue seems to be related to #2218. Note that this other case with writable() works as expected (main exits immediately), so I think it has something to do with read interests and not general handling of EPOLLERR:

use tokio::io::unix::AsyncFd;
use nix::unistd::{pipe, write, close};

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let (rx, tx) = pipe().unwrap();
    let async_tx = AsyncFd::new(tx).unwrap();

    // Fill kernel buffer so that epoll_wait() brings up EPOLLERR without EPOLLOUT
    write(tx, &[0; 65536]).unwrap();
    
    close(rx).unwrap();

    // Properly wakes up on EPOLLERR
    async_tx.writable().await.unwrap().retain_ready();
}

Dec 27 '21 13:12 3442

Thoughts? @Thomasdezeeuw

Dec 27 '21 13:12 Darksonn

@Darksonn my first instinct would be that Tokio isn't checking Event::is_error when handling read events, but I don't really know where the related code is so it's hard for me to say. If you can point to the related code I can look into it further.

Dec 27 '21 14:12 Thomasdezeeuw

Well, it is getting the events here. The set_readiness call goes here.

Dec 27 '21 15:12 Darksonn

Looking at Ready::from_mio it doesn't seem to consider mio::Error::is_error. So it already goes wrong in the call here: https://github.com/tokio-rs/tokio/blob/78e0f0b42a4f7a50f3986f576703e5a3cb473b79/tokio/src/io/driver/mod.rs#L174

Dec 27 '21 16:12 Thomasdezeeuw

What should the fix do? Should async_fd.readable().await return an error whenever the EPOLLERR happens? Or should it return success (and the user will get the error when attempting a read)?

The first one makes more sense to me, but I want to hear what you think before working on an implementation.

Jan 04 '22 14:01 BraulioVM

What should the fix do? Should async_fd.readable().await return an error whenever the EPOLLERR happens? Or should it return success (and the user will get the error when attempting a read)?

The first one makes more sense to me, but I want to hear what you think before working on an implementation.

If EPOLLERR is returned went something went wrong, so the most logical path would be to return the error to the user. Getting the error can be a bit tricky though. For sockets we have SO_ERROR, but I'm not sure a (fd) generic version of that exists.

If the error can't be retrieved we could pretend the fd is readable and let the user perform a read operation, that should also return the error. However I'm not sure it's guaranteed, for example I remember fsync in the past loosing errors (on Linux) but I'm a bit fussy on it.

Jan 04 '22 14:01 Thomasdezeeuw

If EPOLLERR is returned went something went wrong, so the most logical path would be to return the error to the user. Getting the error can be a bit tricky though. For sockets we have SO_ERROR, but I'm not sure a (fd) generic version of that exists.

Could we not return an error to the user that basically just says "EPOLLERR happened", and let them retrieve the actual underlying error in the way the find appropriate? So, for instance, they could decide to read SO_ERROR if it's sockets they are dealing with. Or they would attempt a read if they don't have better options

Jan 04 '22 14:01 BraulioVM

What should the fix do? Should async_fd.readable().await return an error whenever the EPOLLERR happens? Or should it return success (and the user will get the error when attempting a read)?

The first one makes more sense to me, but I want to hear what you think before working on an implementation.

The current behavior of async_fd.writable().await is to succeed (try the last test in my original post, it unwrap()s without panic). So I think it would be quite inconsistent for readable().await to not do exactly the same. On the other hand, this behavior does not seem to be documented at all, so the more correct way of returning a "EPOLLERR happened" error is feasible as long as the current writable() behavior is treated as another bug.

Jan 04 '22 14:01 3442

Could we not return an error to the user that basically just says "EPOLLERR happened", and let them retrieve the actual underlying error in the way the find appropriate? So, for instance, they could decide to read SO_ERROR if it's sockets they are dealing with. Or they would attempt a read if they don't have better options

Possibly, but then it would have to be mapped to io::Error and the documentation spell out that itself is not really an error (or at least not the source of the error).

Jan 04 '22 14:01 Thomasdezeeuw

The current behavior of async_fd.writable().await is to succeed (try the last test in my original post, it unwrap()s without panic).

This can be traced down to https://github.com/tokio-rs/mio/blob/dca2134ef355b3c0d00e8e338e44e7d9ed63edac/src/sys/unix/selector/epoll.rs#L180-L187 https://github.com/tokio-rs/mio/blob/05009e4f60335fa00e9ea6a118595548afee0607/src/event/event.rs#L116-L123

An event with EPOLLERR is considered is_write_closed, which wakes up awaiters with a write interest. However, EPOLLERR does not affect is_read_closed.

Maybe, given this issue, it should? Or would that mask errors that users would want to handle explicitly? If so, which sounds reasonable, why does that same logic not apply to is_write_closed?

If EPOLLER affected is_read_closed in mio, that would lead (I think, haven't checked) to @3442's readable().await being woken up and then the next read would fail.

Jan 04 '22 15:01 BraulioVM

This can be traced down to https://github.com/tokio-rs/mio/blob/dca2134ef355b3c0d00e8e338e44e7d9ed63edac/src/sys/unix/selector/epoll.rs#L180-L187 https://github.com/tokio-rs/mio/blob/05009e4f60335fa00e9ea6a118595548afee0607/src/event/event.rs#L116-L123

An event with EPOLLERR is considered is_write_closed, which wakes up awaiters with a write interest. However, EPOLLERR does not affect is_read_closed.

EPOLLERR also triggers is_error. It trigger is_write_closed because of the epoll manual, which states:

Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.

See https://github.com/tokio-rs/mio/pull/1350.

Maybe, given this issue, it should? Or would that mask errors that users would want to handle explicitly? If so, which sounds reasonable, why does that same logic not apply to is_write_closed?

No is_read_closed and is_error are two different things. Tokio should simply check is_error, Mio is not to blame here.

If EPOLLER affected is_read_closed in mio, that would lead (I think, haven't checked) to @3442's readable().await being woken up and then the next read would fail.

Jan 04 '22 15:01 Thomasdezeeuw

EPOLLERR also triggers is_error. It trigger is_write_closed because of the epoll manual, which states:

Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.

Got it

Jan 04 '22 15:01 BraulioVM

I'll close this as solved by #5781. But let me know if I've missed something.

Nov 25 '23 12:11 Darksonn

tokio tokio copied to clipboard

`EPOLLERR` does not wake up `AsyncFd::readable()`

tokio
tokio copied to clipboard