Using direct I/O on Linux
Understanding that redb manages it owns user space page cache, I used the small diff
changes to make the lmdb_benchmark use direct I/O
diff --git a/Cargo.toml b/Cargo.toml
index 649c21b..0819edd 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -20,6 +20,7 @@ pyo3-build-config = { version = "0.20.0", optional = true }
[dependencies]
log = {version = "0.4.17", optional = true }
pyo3 = {version = "0.20.0", features=["extension-module", "abi3-py37"], optional = true }
+rustix = { version = "0.38.34", features = ["fs"] }
[target.'cfg(unix)'.dependencies]
libc = "0.2.104"
diff --git a/benches/lmdb_benchmark.rs b/benches/lmdb_benchmark.rs
index 53b79b1..1cf2f88 100644
--- a/benches/lmdb_benchmark.rs
+++ b/benches/lmdb_benchmark.rs
@@ -297,7 +297,25 @@ fn main() {
benchmark(table)
};
- let lmdb_results = {
+ let redb_directio_results = {
+ let tmpfile: NamedTempFile = NamedTempFile::new_in(&tmpdir).unwrap();
+ use std::fs::OpenOptions;
+ use std::os::unix::fs::OpenOptionsExt;
+ let file = OpenOptions::new()
+ .read(true)
+ .write(true)
+ .custom_flags(libc::O_DIRECT)
+ .open(tmpfile.path())
+ .unwrap();
+ let db = redb::Database::builder()
+ .set_cache_size(4 * 1024 * 1024 * 1024)
+ .create_file(file)
+ .unwrap();
+ let table = RedbBenchDatabase::new(&db);
+ benchmark(table)
+ };
+
+ /*let lmdb_results = {
let tmpfile: TempDir = tempfile::tempdir_in(&tmpdir).unwrap();
let env = lmdb::Environment::new().open(tmpfile.path()).unwrap();
env.set_map_size(4096 * 1024 * 1024).unwrap();
@@ -325,7 +343,7 @@ fn main() {
let db = sanakirja::Env::new(tmpfile.path(), 4096 * 1024 * 1024, 2).unwrap();
let table = SanakirjaBenchDatabase::new(&db);
benchmark(table)
- };
+ };*/
fs::remove_dir_all(&tmpdir).unwrap();
@@ -337,10 +355,11 @@ fn main() {
for results in [
redb_latency_results,
- lmdb_results,
+ redb_directio_results,
+ /*lmdb_results,
rocksdb_results,
sled_results,
- sanakirja_results,
+ sanakirja_results,*/
] {
for (i, (_benchmark, duration)) in results.iter().enumerate() {
rows[i].push(format!("{}ms", duration.as_millis()));
diff --git a/src/tree_store/page_store/file_backend/unix.rs b/src/tree_store/page_store/file_backend/unix.rs
index db24c48..b49cfb6 100644
--- a/src/tree_store/page_store/file_backend/unix.rs
+++ b/src/tree_store/page_store/file_backend/unix.rs
@@ -19,6 +19,8 @@ use std::os::wasi::{fs::FileExt, io::AsRawFd};
#[derive(Debug)]
pub struct FileBackend {
file: File,
+ mem_align: usize,
+ offset_align: usize,
}
impl FileBackend {
@@ -33,6 +35,15 @@ impl FileBackend {
/// Creates a new backend which stores data to the given file.
#[cfg(unix)] // remove this line when wasi-libc gets flock
pub fn new(file: File) -> Result<Self, DatabaseError> {
+ let stat = rustix::fs::statx(
+ &file,
+ "",
+ rustix::fs::AtFlags::EMPTY_PATH,
+ rustix::fs::StatxFlags::DIOALIGN,
+ )
+ .unwrap();
+ let mem_align = stat.stx_dio_mem_align as usize;
+ let offset_align = stat.stx_dio_offset_align as usize;
let fd = file.as_raw_fd();
let result = unsafe { libc::flock(fd, libc::LOCK_EX | libc::LOCK_NB) };
if result != 0 {
@@ -43,7 +54,11 @@ impl FileBackend {
Err(err.into())
}
} else {
- Ok(Self { file })
+ Ok(Self {
+ file,
+ mem_align,
+ offset_align,
+ })
}
}
}
@@ -54,7 +69,10 @@ impl StorageBackend for FileBackend {
}
fn read(&self, offset: u64, len: usize) -> Result<Vec<u8>, io::Error> {
+ assert!(offset % self.offset_align as u64 == 0);
+ assert!(len % self.offset_align == 0);
let mut buffer = vec![0; len];
+ assert!(buffer.as_ptr() as usize % self.mem_align == 0);
self.file.read_exact_at(&mut buffer, offset)?;
Ok(buffer)
}
@@ -83,6 +101,9 @@ impl StorageBackend for FileBackend {
}
fn write(&self, offset: u64, data: &[u8]) -> Result<(), io::Error> {
+ assert!(offset % self.offset_align as u64 == 0);
+ assert!(data.len() % self.offset_align == 0);
+ assert!(data.as_ptr() as usize % self.mem_align == 0);
self.file.write_all_at(data, offset)
}
}
diff --git a/src/tree_store/page_store/header.rs b/src/tree_store/page_store/header.rs
index cd4be9d..4809a6a 100644
--- a/src/tree_store/page_store/header.rs
+++ b/src/tree_store/page_store/header.rs
@@ -53,7 +53,7 @@ const REGION_TRACKER_PAGE_NUMBER_OFFSET: usize =
const TRANSACTION_SIZE: usize = 128;
const TRANSACTION_0_OFFSET: usize = 64;
const TRANSACTION_1_OFFSET: usize = TRANSACTION_0_OFFSET + TRANSACTION_SIZE;
-pub(super) const DB_HEADER_SIZE: usize = TRANSACTION_1_OFFSET + TRANSACTION_SIZE;
+pub(super) const DB_HEADER_SIZE: usize = 512;
// God byte flags
const PRIMARY_BIT: u8 = 1;
to run the lmdb_benchmark using direct I/O which appears to works but significantly regresses performance, especially for the "batch writes" scenario which became three times slower (7s versus 24s) on my system.
While this is obviously not a supported use case and hence this not meant as a bug report but rather a question, this result did surprise me. Maybe someone else has an idea why bypassing the kernel's page cache actually slows redb down?
So with the changes from #809, direct I/O still ends up being slightly slower for me:
| buffered | direct | |
|---|---|---|
| bulk load | 4369ms | 4639ms |
| individual writes | 385ms | 436ms |
| batch writes | 6668ms | 6989ms |
| len() | 0ms | 0ms |
| random reads | 1065ms | 1071ms |
| random reads | 1062ms | 1067ms |
| random range reads | 2759ms | 2765ms |
| random range reads | 2750ms | 2770ms |
| random reads (4 threads) | 287ms | 287ms |
| random reads (8 threads) | 169ms | 169ms |
| random reads (16 threads) | 148ms | 149ms |
| random reads (32 threads) | 127ms | 127ms |
| removals | 3401ms | 4981ms |
except for removals which is still takes almost twice as long.
Any ideas where these remaining effects are coming from?
Hmm, I haven't played with direct i/o a whole lot, but here are a few hypotheses:
- write back buffering and coalescing. Without direct i/o all the
write()s which are mostly small go into the page cache, and then are flushed to disk before/at the call tofsync()and so they can be coalesced into larger writes. With O_DIRECT, I believe every call towrite()goes straight to disk, so it's likely doing more write operations to disk - it might have something to do with the user space cache invalidation. I invalidate entries on any free, so in a delete heavy workload like removals it'll free a page, evict it from the cache, and then immediately recreate it in the cache.
- maybe it has something to do with readahead? but since the read benchmarks aren't effected I think this is unlikely
With O_DIRECT, I believe every call to write() goes straight to disk, so it's likely doing more write operations to disk
Not necessarily, or rather only if O_DSYNC is passed, which makes things significantly slower again.
But I agree the opportunities for coalescing might be reduced to what is queued in the I/O layer. I guess direct I/O would only really make sense when combined with asynchronous I/O so that the program can issue all its I/O "immediately" so the block layer can do almost as much coalescing as the page cache could have done before the fsync hits.
it might have something to do with the user space cache invalidation. I invalidate entries on any free, so in a delete heavy workload like removals it'll free a page, evict it from the cache, and then immediately recreate it in the cache.
So this sounds like something that could be optimized? For a free'd page, its contents should not matter, should it? So it could just be declared clean and but into the read cache? Maybe after zeroing out the in-memory representation? We'd just need to be sure that when the supposedly clean page is evicted and recreated from disk, that leads to the same result.
I think this also the main benefit of running with direct I/O, it shines a light on hidden inefficiencies which the kernel's page cache currently smooths over. But on a fully laden production system without cache memory to spare, not depending on the page cache's help is presumably rather helpful.
So this sounds like something that could be optimized?
It's less trivial than it seems. Because pages can be allocated at different orders, the size of the cached page might change. If you want to hack it into the benchmark to see if it has an effect though, it's probably safe to just assume the page size doesn't change since I think the benchmark only writes small values.
So this sounds like something that could be optimized? For a free'd page, its contents should not matter, should it? So it could just be declared clean and but into the read cache? Maybe after zeroing out the in-memory representation? We'd just need to be sure that when the supposedly clean page is evicted and recreated from disk, that leads to the same result.
I looked into this, and actually I don't think this would contribute that much. The code already avoids reading the free'd page from disk.
I looked into this, and actually I don't think this would contribute that much.
So if this re-creation of free'd pages is inconsequential, where is the I/O overhead in the "removals" benchmark coming from? Updating the tree metadata on disk?
I'm not sure. Have you tried profiling with and without direct_io? The just flamegraph target will output a flamegraph of the lmdb_benchmark run
I tried your diff, but it panics with alignment errors in FileBackend. If you send a draft PR that properly handles all the alignment, I'll look into this further
Sorry for not getting back earlier, I did not have time yet to rerun the benchmarks. Just two remarks:
- The diff posted above calls
rustix::fs::statxto check the required but indeed does not ensureredbactually respects it. However, ifredb's page size is larger than the file system alignment requirements, this should not be an issue. Depending on the file system you use, the only thing required is to increaseDB_HEADER_SIZEwhere for me 512 bytes was sufficient for ext4. - Concerning flamegraphs, not if this ends up being I/O bound instead of CPU, then I think we have to take care to perform off-CPU profiling so that the profile includes time spent being blocked for I/O in the kernel.
For example, I am getting
[src/tree_store/page_store/file_backend/unix.rs:45:25] stat.stx_dio_mem_align as usize = 4
[src/tree_store/page_store/file_backend/unix.rs:46:28] stat.stx_dio_offset_align as usize = 512
in home directory which is why I chose DB_HEADER_SIZE as 512.
Not sure if this is already helpful, but I used Hotspot as it has checkbox-level support for off-CPU profiling and limited the benchmarks to redb doing bulk insertion and then removals, with and without direct I/O.
Looking at the flamegraph for off-CPU time (after filtering out the Ctrl+C handler thread which is blocked reading on its pipe and totally skews off-CPU time), I get the following flame graph for buffered I/O
and using direct I/O
So instead of waiting on File::sync_data, it waits on File::write_all_at called from PagedCachedFile::flush_write_buffer which does make sense as data is written immediately instead being flushed as a batch at the end, but I as indicated in the beginning I am not sure how useful this is for pin pointing an underlying root cause.
This could end up coming back to direct I/O only really making sense using asynchronous I/O as well: With direct I/O, we write one page after the other if a transaction touched multiple pages whereas using buffered I/O we just push those pages into the page cache which can issue the writes in parallel/batched when sync_data is called.
Depending on the file system you use, the only thing required is to increase DB_HEADER_SIZE where for me 512 bytes was sufficient for ext4.
Unfortunately, I hit a panic in read() because the allocated vector had the wrong alignment.
This could end up coming back to direct I/O only really making sense using asynchronous I/O as well: With direct I/O, we write one page after the other if a transaction touched multiple pages whereas using buffered I/O we just push those pages into the page cache which can issue the writes in parallel/batched when sync_data is called.
I see, ya it seems like async io might be required, if all the individual writes with direct io are more expensive than the OS performing them during the fsync.
The only thing I can think of is that you could try to coalesce consecutive writes into calls to write_all_vectored(). Basically in these two for-loops, if the offset is equal to the previous offset plus the previous buffer length, then the two can be performed in a single call to write_all_vectored. Arbitrarily many could be grouped together, if they're consecutive.
Unfortunately, I hit a panic in read() because the allocated vector had the wrong alignment.
Ah ok, this would suggest you are using an older kernel version where direct I/O still requires page-aligned buffers? With recent kernels, this requirements has been dropped/reduced and the general alignment guaranteed by the heap allocator suffices. Otherwise, this get's tricky as libc::posix_memalign needs to be involved, i.e. safe code and the built-in primitives like Vec and Arc do not suffice.
I'm on the 6.8 kernel, but it was this assert in your diff: assert!(buffer.as_ptr() as usize % self.mem_align == 0);
I think the assert is correct and necessary, i.e. direct I/O usually has alignment requirements, but they were significantly reduced in "later" kernels down from page alignment to lower values like e.g. the reported four on my system (ext4 on an NVMe drive on x86-64).
I would be interesting to see which value stx_dio_mem_align reports on your system? (I think almost anything in the I/O stack can influence this, e.g. filesystem, block device driver, volume block size, etc.)
It required alignment of 512 Sent from my phone
On Mon, Jul 22, 2024, 10:26 PM Adam Reichold @.***> wrote:
I think the assert is correct and necessary, i.e. direct I/O usually has alignment requirements, but they were significantly reduced in "later" kernels down from page alignment to lower values like e.g. the reported four on my system (ext4 on an NVMe drive on x86-64).
I would be interesting to see which value stx_dio_mem_align reports on your system? (I think almost anything in the I/O stack can influence this, e.g. filesystem, block device driver, volume block size, etc.)
— Reply to this email directly, view it on GitHub https://github.com/cberner/redb/issues/808#issuecomment-2244280335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGNXQDDJF2K6IL3KJIMTB3ZNXSR5AVCNFSM6AAAAABHS2CJC2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUGI4DAMZTGU . You are receiving this because you commented.Message ID: @.***>
This sounds like the device block size and is certainly higher than the natural alignment (sufficient alignment for all primitive types provided by the dynamic allocator by default).
So I think for such a setup, we would need to implement our own memory management here which I think is out of scope for redb (and requires significant amounts of unsafe code). I think compared to that, using memory maps is the better trade-off of unsafe code versus performance.
I do not want to re-open the memory maps debate, I think the position that redb takes is eminently reasonable. I just think it shows that direct I/O does not really fit into that position. I also feel like we have learned everything there is to learn here for now. So I maybe we just close this having found one optimization opportunity?
Yep, thanks for contributing that optimization!
Using RWF_UNCACHED as proposed here might be an ideal solution: It bypasses the basically redundant kernel caching while still using the page cache to consolidating multiple logical writes into larger block writes, has no additional requirements on the IO requests and could be as simple as passing an additional flag to libc's IO functions.