libs-team icon indicating copy to clipboard operation
libs-team copied to clipboard

Concurrent Read-Write Locks on Slices

Open LeoPloutno opened this issue 10 months ago • 9 comments

Proposal

Add an API for concurrent mutable and immutable access of disjoint elements/subslices of a slice.

Problem statement

There is currently no standard, safe way to mutably and concurrently access disjoint elements of a slice in a non-blocking way if iconcurrent immutable access of the whole slice is also needed.

Motivating examples or use cases

  • Say, one wishes to program a concurrent image filter applier. That usually involves convolutions, i.e., multiplying and summing over a window of the slice (image data) and storing the result in that same slice. As it stands now, the only (safe) way to do such a thing is to wrap the slice in a Vec<T> (or Box<[T]>) and put the owning object in a RwLock. This way, though, the stores will be sequential despite being accesses to non-overlapping sections of memory.
  • In a molecular dynamics simulation, the state of all the particles at a particular step depends on the state of the whole system at the previous step. One would want a way to calculate the updated state concurrently via reading the old slice, applying the propagation algorithm, then store the result back to said slice to update the system.
  • Numerical differentiation, Fourier Transforms and statistical analyzes also require convolutions, which could be parallelized (see first use case for details).
  • Gradient descent in higher dimensions is well-parallelizable - each thread computes its partial derivative using the whole slice, computes the optimal step and stores the final state back as a part of the iterative process.
  • Gamedev (?): when a number of entities interact and mutate each other's state, concurrency could be upplied to speed things up.

Solution sketch

The general idea is to add a type resembling RwLock that could provide either a guard for reading the whole slice, a guard for writing to the element assigned to the lock object or a way to read the assigned element (which, depending on the API, could potentially not need any guards whatsoever!). The principle above could also apply to non-overlapping subslices rather than elements.

The locks could be the owners of the underlying slice, which would require them to have atomic counters. Alternatively, they could be derived from another owning object, thereby allowing their use only in scoped threads.

A sample of the proposed API:

use <path_to_feature>::{SliceRwLock, SliceRwLockReadSliceGuard, SliceRwLockWriteGuard, Iter}
fn main() {
    // allocate some data
    let mut s = vec![1, 2, 3, 4].into_boxed_slice();

    // create an iterator that yields the locks for all elements
    let mut iter: Iter<usize> = s.into();
    std::sync::scope(|s| {  // not necessary, simply convenient 
        for lock in iter {
            s.spawn(move || {
                let r_guard: SliceRwLockReadSliceGuard<_>  = lock.read_slice();
                let mut sum = r.iter().sum();
                drop(r_guard);  // no need for the guard - decrement the counter
                let elem: &usize = lock.read();  // writers can only access their unique element - no need for a guard
                sum += *elem;
                let w_guard: SliceRwLockWriteGuard<_> = lock.write();
                *w_guard = sum;
            })
        }
    });
}

Roughly the same API could be used for non-overlapping subslices of runtime-determined length or chunks of const size - the write guard, when dereferenced, returns a &mut [T] and &mut [T; N], respectively, and the read function returns the same, only immutable.

Alternatives

  • An alternative, more general API would allow the creation of four types of guards: read whole slice, read assigned element, write to whole slice, write to assigned element. The mere existance of the guard that allows mutable access to the whole span automatically forces us to use a guard for the immutable access of a single element, since a writer from another thread could be accessing it too. Without it, any thread is allowed to read the contents of its assigned element via a reference as long as the write() function takes a &mut self.
  • Extend the functionality of RwLock<[T]> to accomodate such functionality (impractical, since RwLock is not reference-counted. Non-compatible APIs).
  • Use RwLock<Container<T>> and write to elements sequentially (linear time complexity vs constant time complexity if truely parallelized).
  • Use lots of unsafe code to get the described behavior (self explanatory).

This feature (as nearly anything else) could be done as a crate on ocrates.io, but I think that it could neatly complement and extend the already-existing functionality of RwLock to allow more user-friendly slice concurrency, to make that API more general and encompasing. Also, it would really benefit from core::intrinsics and unstable APIs and features (like allocator_api and ptr_metadata) and compiler-level optimizations which are available primarily in the standard library.

I also think this API (or any other one that offers the proposed functionality) should be a part of the standard library for yhe same reason Arc, RwLock and slice::chunks(_exact) are:

  • RwLock is essencially a special case of what I am proposing - a type that allows mutable and immutable access to the same data across threads. It can be implemented using a simple (Mutex, Condvar) pair - the problem is that unsafe code is required to bypass the borrow checker, so it's handed to the programmer via the standard library. This case is no different - according to the concurrency model rust uses, simultanious access to disjoint parts of a slice are allowed, yet there is currently no safe way to do it.
  • As the point above says, accessing non-overlapping chunks of a slice at the same time is permitted- hence, slice::chunks is a thing. It can also be achieved using trivial pointer arithmetic and some unsafe code, yet it's in the standard library because it is a good, safe, commonly used abstraction.

Links and related work

https://users.rust-lang.org/t/writing-disjoint-slices-and-reading-whole-slice-in-multithreading/118076

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

LeoPloutno avatar Feb 18 '25 15:02 LeoPloutno

Might want to edit the title into an appropriate summary of your request

pitaj avatar Feb 18 '25 16:02 pitaj

In your proposal you used a thread scope. But it should already be possible to achieve this outcome with thread scopes and simple mutable references. This uses the rayon threadpool scope but similar should be possible with std:

fn quicksort_job_scoped<'s, 'env: 's, T: Ord + Send>(s: &Scope<'s>, slice: &'env mut [T]) {
    if slice.len() <= 0x8000 {
        quicksort(slice);
    } else {
        let (left, right) = partition(slice);
        s.spawn(|s| quicksort_job_scoped(s, left));
        s.spawn(|s| quicksort_job_scoped(s, right));
    }
}

fn threaded_quicksort<T: Ord + Send>(slice: &mut [T], tp: &ThreadPool) {
    tp.scope(|s| {
        s.spawn(|s| {
            quicksort_job_scoped(s, slice);
        });
    });
}

ajwock avatar Feb 18 '25 16:02 ajwock

The code snippet in the proposal is merely an example of the usage of the API, not a concrete solution to a concrete problem. It would work the same if I didn't use a scope and merged the threads at the end of the program. I just wanted to keep it concise and used a scope.

What does your code demonstrate? As I understand it, you didn't mutably access an element from another thread, just spawned another one.

Also, where are the partition and quicksort functions from?

LeoPloutno avatar Feb 18 '25 17:02 LeoPloutno

Apologies, that snippet provides less context than I intended. A few months back I was trying to figure out how to get around this sort of problem with only safe rust with quicksort as my toy problem to do it with. This repo demonstrates a rayon implementation and a std implementation (not of production quality; it was just an exercise for myself): https://github.com/ajwock/gen_projects/tree/main

Indeed, this would only work on scoped threads. But when you're working with a threadpool as in rayon, you don't need to newly spawn threads for the task, just send jobs off to them.

I acknowledge that there are limitations to this and that the scope is necessarily part of how this works. However, lifetimes are inherently an issue when you're sharing an object across threads, and the object must live at least as long as any thread. Scopes are a solution to this. What you might want instead is some sort of slice-specialization of a shared pointer such that multiple threads have disjoint ownership of parts of the allocation, but that the whole allocation remains valid if any thread holds access to any part of it.

I think a non-std library would be more appropriate in that case.

ajwock avatar Feb 18 '25 20:02 ajwock

I don't think we are talking about the same kind of problem... In your code you had to rely on scoped threads because you were sending references between them uaing channels, and for the references to be valid you had to ensure the lifetime of the workers won't exceed that of the slice. What this ACP proposes is a more general API that would allow cheap, concurrent read and write acces to non-overlapping parts of the same allocation. While the implementation could be using channels like you were, it is not the optimal solution for this kind of problem. For instance, if every thread in your program needed to read every subslice (partition, in this case), you'ld have every thread send its subslice to every other thread, achieving linear time complexity if the threads end up on different cores and quadratic time complexity if the provessor executed the code sequentially! I am offering a solution with O(1) or O(n) complexity, respectively.

The main point is not the necessity of scopes or lack thereof - the API is quite similar in both cases, with different tradeoffs:

  • If we want the slice to live as long as any of the locks is alive, a reference counter is needed (like Arc). That would allow more flexibility at a small runtime cost.
  • If the opposite is desirable - let the threads not outlive the slice - then scopes are required. That lifts some implementation restrictions and eliminates the need for atomic counters, at the cost of many use cases where scoped threads do not fit.

LeoPloutno avatar Feb 19 '25 03:02 LeoPloutno

I think a concrete implementation on crates.io would be helpful because the proposal as written touches several corners in a relatively big design space that each require different APIs and implementation strategies. Even focusing on one particular use case, it’s not obvious to me what “the right solution” looks like. For example, fine-grained per-element locking needs a huge amount of lock state when implemented in the straightforward way (sharding reduces that but can lead to performance cliffs and occasional deadlocks depending on the API), and has significant overhead on every write, so I don’t know if anyone would actually want to use that for an image filter.

hanna-kruppe avatar Feb 19 '25 14:02 hanna-kruppe

Will do, however it will take me quite some time. Will the ACP be still open until then?

LeoPloutno avatar Feb 19 '25 14:02 LeoPloutno

I am not a team member, so I can’t say anything about the process. Nor should my opinion be taken as gospel about whether/ under which conditions the proposal may be accepted. I’m just saying, it’s hard to say yes (or no) to anything without knowing what the API(s) will look like and whether they can be implemented satisfactorily.

hanna-kruppe avatar Feb 19 '25 15:02 hanna-kruppe

(A concrete API proposal is a great first step because domain experts can often get a good idea of implementability from that alone. But “here’s an implementation and here’s a program using it to great effect” can be useful if there’s doubt about the usefulness of the proposed API.)

hanna-kruppe avatar Feb 19 '25 15:02 hanna-kruppe