scryer-prolog rebis-dev: add opt-in concurrency for OffsetTableImpl

The global tables of offset_table.rs are removed in favour of a sum type, owned by Arena, that can be toggled between Serial and Concurrent variants. Offset table lookups must now be performed explicitly instead of implicitly, which the read_heap_cell! macro no longer does. The machine can migrate the table data from Serial to Concurrent when spawning a new thread from the parent machine thread for the first time and back from Concurrent to Serial again once all child machine threads have been joined / offset table access can safely happen without synchronization once more. This addresses the performance issue raised in #2908 in single-threaded Scryer.

I've created this PR instead of merging the commit directly to rebis-dev because I'm not sure I've implemented it quite right. Previously the offset table had type Arcu<RawBlock<OffsetTableImpl<T>>, GlobalEpochCounter> instead of Arc<RawBlock<T>, GlobalEpochCounter>, for instance. I don't know why OffsetTableImpl<T> appeared recursively in that type but I'm sure there was a reason.

I'd appreciate feedback from @Skgland particularly, as the author of the arcu crate, and anyone else who cares to weigh in on this or any other potential problems in this PR. Thanks.

May 23 '25 07:05 mthom

I'd appreciate feedback from @Skgland particularly, as the author of the arcu crate, and anyone else who cares to weigh in on this or any other potential problems in this PR. Thanks.

I won't be able to properly read and as such give feedback to this until after the 6th of June.

May 23 '25 07:05 Skgland

@adri326 and @bakaq: since you have also shown interest in architectural questions in multi-threaded context, I would greatly appreciate if you would also like to have a look at this, if you have time!

May 23 '25 10:05 triska

Looks mostly good to me.

The ConcurrentOffsetTable impl looks a bit fishy to me. I am not sure what prevents build_with and lookup_mut from racing.

Consider the following execution:

Thread 1 calls lookup_mut to get a mutable reference to an entry and gets preempted.
Thread 2 calls build_with and causes a re-allocation of the block (copying the current/old value of the entry that Thread 1 has a mutable reference to)
Thread 1 continues and makes changes via the reference (pointing in the old block).

This has two problems:

unsoundness: as T2 copies/reads the value while T1 has an exclusive/mutable reference.
lost update: the update made by T1 is lost as T2 makes a copy between the point when T1 acquires the reference and when T1 update the value

Or simpler whats stopping multiple threads from calling lookup_mut for the same entry.

Jun 10 '25 19:06 Skgland

Ah, I think I overlooked build_with being unsafe and lookup_mut doesn't return a &mut T but (a smart pointer equivalent to) &UnsafeCell<T>.

I will have to have another, more thorough, look at this.

Jun 11 '25 04:06 Skgland

If I understand this correctly then while single threaded one would have a OffsetTableImpl containing a InnerOffsetTableImpl::Serial and when a second thread is spawned that would be turned into a InnerOffsetTableImpl::Concurrent and each thread would have it's own clone of the inner Arc to share the table.

As multi-threading isn't implemented yet this is a bit speculative, but the only way I could see ConcurrentOffsetTable::lookup_mut be used soundly and correctly is if for the duration of the returned value being alive the ConcurrentOffsetTable.upgrade lock was being held. For example by changing ConcurrentOffsetTable::lookup_mut to return a smartpointer containing a mutex guard. That still leaves the problem of lookup and lookup_mut allowing a shard and mutable reference at the same time.

Otherwise I think my points from https://github.com/mthom/scryer-prolog/pull/2967#issuecomment-2960371209 still stand, though mutable reference is only created via DerefMut on TablePtrMut and not directly in lookup_mut.

This appears to be a pre-existing problem, as previously lookup was basically the new lookup_mut and TablePtr was TablePtrMut with all of these problems.

In summary currently nothing appears to prevent:

a TablePtr and a TablePtrMut (or two TablePtrMut) from being given out concurrently for the same offset
the ConcurrentOffsetTables block from being re-allocated while a TablePtrMut has been given out

Jun 13 '25 18:06 Skgland

@Skgland The latest commit is an improvement but there's nothing preventing two threads from taking out pointers to the same offset at once. I'm still unsure of how to handle that. lookup_mut could claim the write part of the lock but that would lock the entire table down, not just the word referred by the offset.

Jun 21 '25 06:06 mthom

What about changing the Mutex to a RwLock, build_with would take a write lock while lookup_mut would take a read lock. This would allow concurrent lookup_mut while keeping the exclusiveness for build_with. Though this would not solve the problem of allowing multiple concurrent lookup_{,mut} for the same index.

Including a lock guard might require switching from a std Mutex/RwLock to e.g. tokio as those have owned guards.

Edit: Should have read the commit before responding to the comment. I will have to have a look in the evening.

Jun 21 '25 07:06 Skgland

A simple and correct solution seems a very good choice, even if it yields suboptimal performance or even a slowdown at the moment.

Jun 21 '25 07:06 triska

@Skgland Thanks. I hope I haven't disrupted your reading with my latest commit (made just now) but I think it's complete now.

Jun 22 '25 07:06 mthom

I noticed a lot of table.with_entry(offset, |v| *v) calls maybe a function

fn get_entry(&self, offset: usize) -> T where T: Copy {
   self.with_entry(|v| *v)
}

would make this more ergonomic/less repetitive?

Also it looks like the only OffsetTable that uses with_entry_mut is the CodeIndexTable. Maybe it would make sense to have a variant of OffsetTable without with_entry_mut? Without with_entry_mut the initial (much simpler) implementation would have been fine.

Jun 22 '25 09:06 Skgland

I tried rewriting the commit previous to the latest one by changing the type of offset_locks to Arc<Vec<Arc<RwLock>>> using parking_lot's ArcRwLockReadGuard to get around the lifetime issues. Benchmarking ConcurrentTableImpl on test is much, much slower than the current commit. get_entry is of course a good idea, but it's too late for me now. Will do it tomorrow.

Jun 24 '25 07:06 mthom

I will merge this soon if there are no further comments.

Jun 26 '25 05:06 mthom