libsql icon indicating copy to clipboard operation
libsql copied to clipboard

Push free page management down to the VFS layer

Open losfair opened this issue 1 year ago • 10 comments

SQLite uses a freelist to manage free pages. This is a major source of INSERT contention when using an optimistic lock-free storage layer, like mvSQLite.

Ideally we should push free page management down to the VFS layer:

// struct sqlite3_io_methods
int (*xFreePage)(sqlite3_file*, sqlite3_int64 pageNo); // free a page
int (*xAllocatePage)(sqlite3_file*, sqlite3_int64 *pageNoOut); // allocate a new page and return its id

losfair avatar Oct 04 '22 18:10 losfair

I wondered if it should be abstracted away to another virtual table, since it doesn't technically need to touch disk, like other io_methods, but managing free pages is tightly bounded to disk operations anyway, so sqlite3_io_methods indeed sounds like the best place.

psarna avatar Oct 05 '22 08:10 psarna

On the other hand, it starts to look like a layering violation - page freelist is a property of a pager; I'll experiment with the best place to put such a virtual interface a little more and come back with a definitive answer. Perhaps a pager needs a pVfs table for virtual functions related to I/O, and a separate virtual table for an allocation scheme.

psarna avatar Oct 05 '22 10:10 psarna

The actual implementation is even deeper, in BTree: https://github.com/libsql/libsql/blob/aafe1d5457936a05bca5d312e2fd25f32afeb20b/src/btree.c#L6244 https://github.com/libsql/libsql/blob/aafe1d5457936a05bca5d312e2fd25f32afeb20b/src/btree.c#L6566

@losfair to clarify, you're interested in being able to override the existing hardcoded freelist management, which uses offsets 32 and 36 of each page to maintain a freelist, and instead be able to implement your own scheme, right? It's important to note that after it's done, the database will most likely only be readable if you later run it with the same exact page allocation/deallocation implementation set up. Right now the freelist is persisted on disk, so that the database is able to check which pages are free during bootstrap.

psarna avatar Oct 05 '22 10:10 psarna

@psarna Yes I would like to be able to override the builtin freelist-based free page management - the now-unused word at offset 32/36 can be filled with zero then?

It's important to note that after it's done, the database will most likely only be readable if you later run it with the same exact page allocation/deallocation implementation set up.

Wouldn't this only break writes but not reads under the default configuration?

losfair avatar Oct 05 '22 11:10 losfair

Alright, then I'll take a stab at providing a way for virtualizing the freelist management and send it for peer review as a pull request once done.

It's important to note that after it's done, the database will most likely only be readable if you later run it with the same exact page allocation/deallocation implementation set up.

Wouldn't this only break writes but not reads under the default configuration?

Well, depends on what the implementation does - in general case, whatever's stored at offsets 32 and 36 can only be properly interpreted by the same (or backward compatible) implementation. But I think it's safe to assume that users intending to override freelist management never need to suddenly change its implementation on an existing file.

By the way, do you have a specific implementation in mind, and can you share any details on it? With a custom implementation, the information about which page is free needs to be persisted somewhere anyway - either at one of these offsets (32,36), or in some external persistent storage, otherwise the database wouldn't be able to figure out which pages are free when bootstrapping.

psarna avatar Oct 05 '22 11:10 psarna

self-note: the implementation would also need to be very careful to override various sanity checks and assumptions, like this one: https://github.com/libsql/libsql/blob/aafe1d5457936a05bca5d312e2fd25f32afeb20b/src/btree.c#L10731

after a custom implementation is allowed, any assumption about data at offset 32 or 36 may not hold true anymore, if a non-default page allocation policy is used.

psarna avatar Oct 05 '22 11:10 psarna

By the way, do you have a specific implementation in mind, and can you share any details on it?

The implementation is mvSQLite. With this change page allocation can be further pushed down to FoundationDB:

  • IDs of freed pages are inserted to a "free set" implemented as a KV prefix subspace.
  • IDs of newly allocated pages are either selected randomly from the free set, or assigned a previously unused random ID (new_id > current_max_page_id).

Offset 32/36 do not need to be reused - the free set is kept internally as database metadata.

losfair avatar Oct 05 '22 15:10 losfair

Got it, makes perfect sense. I'll continue working on it and ping you once I have something workable.

psarna avatar Oct 05 '22 16:10 psarna

@losfair just to make sure to bring this to your attention, there's a PR open implementing a first RFC for this, would love to get your eyes on it!

glommer avatar Oct 25 '22 17:10 glommer

Thanks for implementing this so quickly! Will try it next week.

losfair avatar Oct 26 '22 14:10 losfair