glommio DmaFile returns EOPNOTSUP when reading (error code 95)

I'm sure I must be doing something wrong but I've been staring at this a few days and can't figure out what it might be. I'm consistently failing to read from DmaFile for some reason:

thread 'tests::basic' panicked at 'Failed to read what we just wrote: EnhancedIoError { source: Os { code: 95, kind: Uncategorized, message: "Operation not supported" }, op: "Reading", path: Some("./myfile2CyhlQ.test"), fd: Some(8) }', lib.rs:90:18

Here's the test:

    #[test]
    fn basic() {
        let normal_tmpfile = tempfile::Builder::new()
            .prefix("myfile")
            .suffix(".test")
            .tempfile_in(std::env::current_dir().unwrap())
            .unwrap();
        eprintln!("Running on {:?}", normal_tmpfile.path());

        let ex = glommio::LocalExecutorBuilder::new(glommio::Placement::Fixed(0))
            .make()
            .unwrap();

        ex.run(async move {
            let file = glommio::io::OpenOptions::new()
                .read(true)
                .write(true)
                .create(true)
                .dma_open(normal_tmpfile.path())
                .await
                .unwrap();

            let initial_read = file
                .read_at_aligned(0, 4096)
                .await
                .expect("Failed to read from empty file");
            assert_eq!(initial_read.len(), 0);

            let mut buf = file.alloc_dma_buffer(4096);
            eprintln!("Allocated {} bytes", buf.len());
            buf.as_bytes_mut().fill(0);
            eprintln!("Writing [0, {}]", buf.len());
            file.write_at(buf, 0)
                .await
                .expect("Failed to write [0, 4096)");

            eprintln!("Reading [0, 4096)");
            file.read_at(0, 4096)
                .await
                .expect("Failed to read what we just wrote");
        });
    }

Tested configurations: Kernels tested: 6.5.4-1-MANJARO 6.5.4-2-ck 6.1.55-1-MANJARO I've also tried pinning the test to P cores (13900K) in case that was some kind of issue

Filesystem	Logical Block Size	Physical Block Size	Test Passes
tmpfs	?	?	[x]
btrfs	4096	4096	[x]
ext4	4096	4096	[]
xfs	4096	4096	[]
xfs	512	512	[]
f2f	4096	4096	[]

Will try a different kernel version & update shortly if this turns out to be some kind of kernel bug with io_uring + O_DIRECT.

Sep 23 '23 22:09 vlovich

You mean that this passes with tmpfs and btrfs? Or did you mean to imply the opposite by mistake? I'm surprised this passes with tmpfs, tmpfs doesn't support O_DIRECT.

Sep 23 '23 23:09 glommer

Correct. It passes with tmpfs and btrfs and the read fails everywhere else (writes are fine & I've verified using hexdump & breaking before we attempt the read that the file is materialized correctly)

Sep 23 '23 23:09 vlovich

Ok. We manually disable TMPFS, so this is really just the file falling back to non-O_DIRECT. Code is:

        let pollable = if (fstype.0 as u64) == (libc::TMPFS_MAGIC as u64) {
            PollableStatus::NonPollable(DirectIo::Disabled)
        }

I have no experience with BTRFS, so I don't really know what they support.

I am very puzzled about the others, this seems like a run-of-the-mill write+read pattern with the most conservative buffer sizes possible.

Sep 23 '23 23:09 glommer

Is there any debugging tips you can share? I haven't figured out how to figure out why the kernel is responding in this way.

Sep 23 '23 23:09 vlovich

I'd try to write the equivalent in C and take glommio out of the picture. This will tell you right away if this is a kernel issue. Assuming it is not a kernel issue (always the last one I'd blame), I'd try to add code deep into the call stack, as close as possible to the underlying io_uring sqe preparation to make sure nothing is getting corrupted in the way.

Sep 23 '23 23:09 glommer

You don't happen to have a a basic C example handy do you? Is https://unixism.net/2020/04/io-uring-by-example-part-1-introduction/ a good starting point?

Sep 23 '23 23:09 vlovich

thinking again, there's an even more basic thing that we can do, which is run the unit test suite from liburing. Reading the tests is also a great way to see a lot of examples in action.

Sep 24 '23 02:09 glommer

Haven't tried the repro yet, but one thing I've narrowed it down to is this only happens when the file is created and written to. Then reads start failing. If I touch the file first then read after write works fine.

Sep 24 '23 18:09 vlovich

~~Tracking it down further, it has to be a write of 4096 to position 0 followed by an equivalent read. If I write 8192 for example, this issue doesn't repro. So I think this might be something subtle within the io_uring layer.~~

NVM. Must have been running the test code incorrectly. Read after write fails regardless of the positions/length as long as the file didn't exist prior to opening.

@glommer this doesn't repro for you?

Sep 25 '23 02:09 vlovich

Here's some test cases of initial write followed by a read. No pattern is immediately discernible for me. The last two test cases in particular seem fairly weird...

write pos	write len	read pos	read len	passed
0	4096	0	4096	N
0	8192	0	4096	N
0	8192	4096	4096	N
4096	8192	0	4096	Y
4096	8192	4096	4096	N
4096	8192	4096	8192	N
4096	8192	8192	4096	N
4096	4096	0	4096	Y
4096	4096	4096	4096	N
12288	4096	0	4096	Y
12288	4096	0	8192	Y
12288	4096	0	12288	Y
12288	4096	0	16384	Y
12288	4096	4096	12288	Y
12288	4096	8192	12288	Y
12288	4096	4096	4096	Y
12288	4096	8192	4096	Y
12288	4096	4096	8192	Y
12288	4096	8192	8192	Y
12288	4096	12288	4096	N
12288	4096	4096	12288	Y

Sep 25 '23 17:09 vlovich

I just tried this on my old box, and I can't reproduce on either xfs nor ext4. My kernel is pretty old, I don't update in a while, that's still 5.19

Sep 26 '23 00:09 glommer

https://gist.github.com/vlovich/9ab1bbb9b7c6081365c49b2ce35bf713 works. Not sure if that maps 1:1 with what's happening in glommio, but I have to imagine it's pretty close. @glommer does the C code look correct with respect to what glommio does internally?

The relevant nuances for repro may be:

NVME is configured with natural 4096 bytes (LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)). That being said, this repros on a device I have that only exports a 512 byte size so probably not this.
Block size for partition is 4096 (across all filesystems tested except obviously tmpfs)
Using LVM on top of LUKS.

All this being said, I can't imagine how any of the details matter here. @glommer any tips on what to start debugging within glommio?

Sep 26 '23 16:09 vlovich

take a look at src/sys/uring.rs, and inspect things right before they are put into their SQEs. For example, this is how the SQE gets prepped for a read:

                            sqe.prep_read(op.fd, buf.as_bytes_mut(), pos);

I am starting to think that maybe the buffers are not aligned somehow, and that's why we're getting this error? We should align buffers, and it's hard to believe we'd get ENOTSUPP instead of some other error code for that, but maybe?

If you can - I'd try a similar thing without LUKS, and without LVM (between the two, I'd suspect LUKS more than LVM). It's the most likely thing in that list to actually be causing an issue.

Sep 26 '23 21:09 glommer

I printed all the buffers used by sqe.prep_read and the pointer is always 4096 aligned.

Sep 28 '23 19:09 vlovich

Instrumented the kernel to get more insight. The line that's returning an error is this: https://elixir.bootlin.com/linux/v6.5.5/source/block/blk-core.c#L732

~~I did disable the read/write queues as per https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ / https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performance which I'm assuming explains this. Not sure why btrfs is unimpacted as I think that partition also has it disabled.~~

Also unclear why it has to be a new file followed by a write but otherwise it works. Not sure why running a very simple io_uring C program directly worked...

EDIT: Even after enabling the work queues the error persists so I must be misunderstanding the exact failure path within the kernel.

Sep 28 '23 21:09 vlovich