heed icon indicating copy to clipboard operation
heed copied to clipboard

Implementing parallel read iterators

Open xav-db opened this issue 8 months ago • 4 comments

I would really like a way to be able to iterate over a heed RoIter or RoPrefix in parallel. I think the way is to make an iterator of iterators where each iterator in the main iterator is consumed on its own thread. It would be good to see an idiomatic way to approach this if nothing exists in the library to help with this already!

xav-db avatar May 01 '25 17:05 xav-db

Hey @xav-db 👋

Thank you for your interest in heed. I have a bunch of ideas to make it possible to iterate and share the work associated with each database entry over multiple threads. I'll only talk about the most correct one.

There is no direct way to know the key repartition in an LMDB database. Therefore, my proposal is to manage one (1) every n keys where n is the number of working threads.

The idea is to create one iterator (cursor in LMDB's term) on each thread, from the same transaction to keep the same view.

Unfortunately, we cannot borrow an immutable RoTxn over multiple threads. So we have to define another strategy: We create the different iterators on the main thread, store them in a Vec and share each one of them to each thread.

Each iterator is wrapped in the Enumerate iterator combinator. Each thread will then yield only one out of n entries. The yielded entries are, for each thread x, the ones returning true to the following formula: i % n == x where i is the increment returned by the enumerator, n is the total number or working threads and x is the thread ID on which we are.

This is a well known technique to balance workload over multiple threads. It's also the technique that balances the best when the key repartition is unknown.

Another similar technique was to compute the hash of keys and use the same formula but it's more costly (hash computation) and balances less evenly the workload.

I'll try to come with some code at some point (I'm in the train on mobile phone) as I still have some doubts around lifetimes. I'll not try to work on any rayon iterator adaptors.

Kerollmops avatar May 07 '25 12:05 Kerollmops

So, I have made a bit of progress on this subject. I was right, and there were some Send + Sync restrictions here and there. The example below creates the iterators on the main thread and moves them to the different threads. It is safe to move cursors between threads, but moving transactions requires using the NO_TLS option and will be more complex due to lifetimes.

The following example works perfectly and balances the entries evenly across threads. However, based on this branch, it modifies heed to make the cursor's Send.

fn main() -> heed::Result<()> {
    let env;

    let rtxn = env.read_txn()?;
    let n = 9; // number of threads
    let iters: Vec<_> = (0..n)
        .map(|_| db.iter(&rtxn).map(|i| i.enumerate()))
        .collect::<heed::Result<_>>()?;

    std::thread::scope(|s| {
        let threads: Vec<_> = iters
            .into_iter()
            .enumerate()
            .map(|(x, iter)| s.spawn(move || iterate(n, x, iter)))
            .collect();
        threads.into_iter().try_for_each(|t| t.join().unwrap())
    })?;

    Ok(())
}

fn iterate<'t>(
    n: usize, // number of threads
    x: usize, // thread ID
    iter: impl Iterator<Item = (usize, heed::Result<(&'t str, ())>)>,
) -> heed::Result<()> {
    // i % n == x where
    //  i is the increment returned by the enumerator
    //  n is the total number or working threads and
    //  x is the thread ID on which we are.
    let mut count = 0;
    for (i, result) in iter {
        if i % n == x {
            let (key, ()) = result?;
            // eprintln!("thread {x} is looking at {key}");
            count += 1;
        }
    }

    eprintln!("thread {x} has seen {count} keys");

    Ok(())
}

Kerollmops avatar May 07 '25 20:05 Kerollmops

looks great. have implemented it and looks like its going to work well. just need the send on the cursors to be implemented

xav-db avatar May 29 '25 06:05 xav-db

Hey @xav-db 👋

That's very cool to hear.

just need the send on the cursors to be implemented

Have you tried with this branch to see how it goes if we actually really implement Send on the cursors?

Kerollmops avatar Jul 16 '25 12:07 Kerollmops