gilectomy
gilectomy copied to clipboard
Rw locks in dicts
Introducing RW locks for unicode-keys-only dictionaries - they should be safe.
These changes bring about 20% decrease of CPU time consumed by x.py
.
This PR is related to #30
Thanks for doing this experiment!
I'm not convinced that this will make any difference in real-world Python programs. As we all know, x.py is a bad benchmark; all it does is
- look up fib in the module dict
- execute a function
- run a little bytecode
- do a little math
Are you willing to maintain this as a separate PR for a while, until the gilectomy branch can run a wider range of programs, and we can get a sense for how important this is in the general case?
Also, what will happen in your read/write lock if you read-lock the dictionary first then attempt a write-lock?
I'm not convinced that this will make any difference in real-world Python programs.
I've implemented quite a basic locking here, but I believe it should make a difference for any program that accesses essentially any unicode-keyed dictionaries - those include builtins, globals, **kw
-style function arguments and even object attributes!
Also, what will happen in your read/write lock if you read-lock the dictionary first then attempt a write-lock?
Implementation idea itself is simple: I've added a "reader count" variable protected by separate lock, and when this reader count goes from 0 to 1 write lock is grabbed, and when it goes back to 0 write lock is released. So if some thread is executing a read-only operation dictionary is still write-locked, so if some other thread wants to change the dictionary it would have to wait on this write lock. The effect I was aiming for is actually removing the contention on write lock for threads doing the read-only stuff.
So that's the reason why I did not change "locking protocol" (tp_lock
/tp_unlock
attributes) - they are left as general, write-level locks.
Are you willing to maintain this as a separate PR for a while, until the gilectomy branch can run a wider range of programs, and we can get a sense for how important this is in the general case?
I think I will be able to keep this up-to-date with other development - I don't think dictobject would be changed a lot, would it? :)
I've implemented quite a basic locking here, but I believe it should make a difference for any program that accesses essentially any unicode-keyed dictionaries - those include builtins, globals, **kw-style function arguments and even object attributes!
Of course it does! But is that difference positive or negative? ;-)
In your version, dicts are now 8 bytes bigger (reader lock and reader count), and you've added some extra code including if statements. This isn't much overhead, but per-dict and per-dict-operation it adds up in a Python interpreter.
My gut instinct is that contention is low for read operations on nearly all dicts in the general case. So my guess is this isn't worth the tradeoff. But that's why I want to see this tested against more than x.py.
(True story: the original lock in dictobject.c was a read-write lock. But I changed it to a simpler lock, in part because of this gut instinct above.)
As far as my question goes, I meant: what if thread A grabs a read lock on dict D, then while still holding that read lock, attempts to grab a write lock on dict D? Because I bet I could make that happen from user code. And remember my mantra: "any time there is behavior in Python, somebody depends on it."
As far as my question goes, I meant: what if thread A grabs a read lock on dict D, then while still holding that read lock, attempts to grab a write lock on dict D?
That should not be possible from what I got reading the source... as it is the exact reason I restricted RW lock to unicode-keyed dicts only, so that I can be sure that doing a lookup in such dictionary does not execute some random bytecode, only unicode comparison function which essentially compares memory byte by byte. If you can make the situation above happen then this might hang :) But I don't see how you can make it happen...
My gut instinct is that contention is low for read operations on nearly all dicts in the general case. So my guess is this isn't worth the tradeoff. But that's why I want to see this tested against more than x.py.
I fully agree that we'd need some more benchmarks as x.py
is not very representative... so I'll try to keep up with gilectomy
branch up to the point we can run something wider, like Grand Unified benchmark or something.
I've been contemplating this patch for a while now. Today I did another thorough review. I think the patch isn't safe--and I'm starting to worry that the current dict implementation isn't safe, either. The problem occurs when a dict transitions from one "kind" to another--that is, from one lookup function to another.
Let's say we have a dict with only Unicode strings in it. For the sake of argument we'll declare that the current function is lookdict_unicode. At the moment there are no threads interacting with this dict.
- Thread 1 decides to do a SetItem on the dict, setting a non-Unicode key. It locks the dict (ma_lock).
- While the dict is still locked, Thread 2 decides to do a GetItem on the dict. It calls into the current lookup function, lookdict_unicode. This calls dict_lock_readonly. In your current implementation, this will lock ma_readerlock, then wait on ma_lock.
- Thread 1 then finishes its insert and releases ma_lock.
- Thread 2 now acquires ma_lock. dict_lock_readonly finishes, and lookdict_unicode proceeds to examine the dict, even though the dict is no longer a Unicode-only dict. If it encounters the non-Unicode key as it probes the dict, it will very likely crash.
This also made me realize that ma_keys->dk_lookup is mutable state, and might need some synchronization around it.
Indeed... Though this particular unsafety is easy to fix, I think - we just need to check again for lookup function change after we have taken ma_lock. I guess we'd need to check it only in Unicode version of lookup as I think a dict typically doesn't transit from non-string to string lookup except on copying...
This seems like a good idea in principle, have you had any luck making it threadsafe?
How about a more radical approach, where the object is not locked at all for read only access.
Objects are only locked if a thread attempts to write to the object. This would speed up far more than the reported 20% and ensure correct behaviour.
This would give two states: Read-only, non-locking as no writes were attempted (default, very fast) Write/Read, on an object. (slower due to locking)
This means that lookup can be made very much faster and if writes can be buffered we can get very efficient run-time performance.
I was wondering if the locking could be made into groups of elements to improve efficiency? Probably the only way to do that would be divide a list into smaller lists with the associated storage overhead incurred from doing that. I am assuming however that if you have a long enough list that could become much better scaling, rather than globally locking the whole thing. It also means that if a smaller section of the list is changed that not every part would need to be reallocated. So that would greatly improve both list write speed, lookup speed and also locking performance with more cores and longer lists, at the cost of memory overhead. Assuming the number of blocks divided into equals the number of CPUs this overhead could be fairly minimal and still give a significant boost to performance.
You can not omit locking completely, you still need some synchronization mechanism. To give you a hint please answer yourself a question - how would your code behave if there are two threads - one reading and one writing - and reading thread started the read first and then write thread came and changed something? If you don't do any synchronization your reading thread might crash or read garbage or work - depending on how lucky are you. Not a good symptom for any program, but especially for core features of runtime, eh?..
First situation: Two concurrent threads access an element almost simultaneously, one writes to an atomic bool to lock the data then proceeds to write to the data. The reading thread does an Atomic load of the bool check if it is unlocked, and waits until the data is unlocked to read.
Second situation: Two concurrent threads access an element almost simultaneously, one gets 1/2 way through reading and the other thread initialises the lock (torn write). The lock on the reading thread is checked again at the end of the write (cheap) and it has been found to have changed. The read value is discarded and the read waits for the writing lock before attempting again to read. This occurs until a valid read (defined by no change in the lock detected).
We can check for no change of state by incrementing an integer when locking and unlocking (odd numbers are unlocked, even numbers are locked) so a thread checks to see the number unchanged before accepting a read value.
Third situation: Two concurrent threads access an element and do not change it's state, no locks are used.
On Wed, Feb 22, 2017 at 2:41 PM, Vasily [email protected] wrote:
You can not omit locking completely, you still need some synchronization mechanism. To give you a hint please answer yourself a question - how would your code behave if there are two threads - one reading and one writing - and reading thread started the read first and then write thread came and changed something? If you don't do any synchronization your reading thread might crash or read garbage or work - depending on how lucky are you. Not a good symptom for any program, but especially for core features of runtime, eh?..
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/larryhastings/gilectomy/pull/33#issuecomment-281687856, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyWODxI6KMoOkOgJGVrRjZzG6UuNmzYks5rfEkhgaJpZM4IzHEo .
A very interesting article about how to implement a faster solution to the RW-lock problem we're having. https://locklessinc.com/articles/locks/
Note especially the lowest solution.
A public domain (as in, no explicit licence) code: (linked to in the from the above's comments) https://github.com/wiredtiger/wiredtiger/blob/develop/src/support/mtx_rw.c
Crude prototype in C implementing a "dumb spinlock" with a "back-off" for threads querying locks, which starts with 1 CPU cycle then increases, if data is locked, in 75 cycle increments. These requests can reduce memory bandwidth, which strangles multi-core chips. 75 cycles seems reasonable given the cost of these queries, and the latencies involved in updating the atomic variables which allow access to the variable.
An order of magnitude justification for ~75 cycles was found here: http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory
This was done on a 4 core i5 4790k, a chip without HT, using pthreads for threading. Surprisingly the single threaded case with the lock was outperforming the case without. Quite unexpectedly.
Unsure if I try it, here's the prototype: https://gist.github.com/joshring/b321fa162d19f8006c0c5d5395965f0b
I have created a prototype gist with "lock-free" reading, which checks the state did not change during the read else re-read it. Writes are locking.
Performance for reading is higher with more cores you throw at it, writing has the opposite trend.
https://gist.github.com/joshring/5c9ebe0654085b68dcc1fd03bc1c1c2e
To me it seems that what you're describing looks like a simple versioning scheme. Keeping an forever-atomically-increasing version field seems better for me (no proofs, just some gut feeling).
Feel free to hack on my approach - I think what I'm missing in this PR right now is guarding against lookup field change as noted by @larryhastings (see above), otherwise it should be sane enough.
Atomically increasing is fine, but the performance of atomic operations is really horrible. So if we can avoid them in a "versioning scheme" it drastically improves performance. Even in C factors of >50x are typical for reading, as much as 140x I have seen.
To me it seems that what you're describing looks like a simple versioning scheme. Keeping an forever-atomically-increasing version field seems better for me (no proofs, just some gut feeling).
Challenge accepted: Seems I was using a "Seqlock" (like approach), https://en.wikipedia.org/wiki/Seqlock
This is a worst-case benchmark of a highly contested lock, from which we wish to read and write with multiple threads. The "Seqlock" has very cheap reads so it scales very well with threads even with writes.
The standard openMP lock was used in two configurations, with the read inside the writing lock and with the read in a separate lock. In both cases the seqlock with the read outside of the lock was able to outperform the openMP lock.
The code has now been significantly cleaned up and improved: Code: https://gist.github.com/joshring/b1b42841f14abe76999f448348256638
From the Wikipedia page on seqlocks:
"One subtle issue of using seqlocks for a time counter is that it is impossible to step through it with a debugger. The retry logic will trigger all the time because the debugger is slow enough to make the read race occur always."
Uh, I don't want to make it impossible to debug race conditions in Python.
"It should also be noted that the technique will not work for data that contains pointers, [...]"
Dict objects contain pointers.
Debugging appears to work just fine in this implementation.
**EDIT: I only meant it was similar to the Seqlock, but it appears different internally.
Assuming the pointers are updated atomically it should be OK do this with any lock we choose.
On Sat, Mar 11, 2017 at 4:33 PM, Josh Ring [email protected] wrote:
In the general locking case this could still be a problem? Examine the following
Lock set Thread1 reads pointerA Lock unset
Lock set Thread2 writes pointerA (Thread1's copy is atomically updated) Lock unset
The pointer Thread1 has read is invalid while the update is occurring, potentially invalidating the pointer at least temporarily. I think the problem of pointers is a very general one indeed in regards to multi-threaded locking code. What would you suggest to avoid this issue with pointers in a multi-threaded environment? You can serialise the access to dicts, but that will be significantly slower than single core execution.
I am not sure about the debugging comment, since it appears to be specific to the Linux kernel's "time counter" implementation, I will do some experimentation. Race conditions are known to be difficult to detect generally regardless of locking technique however, so there is no "silver bullet" here.
On Sat, Mar 11, 2017 at 1:48 PM, larryhastings [email protected] wrote:
From the Wikipedia page on seqlocks:
"One subtle issue of using seqlocks for a time counter is that it is impossible to step through it with a debugger. The retry logic will trigger all the time because the debugger is slow enough to make the read race occur always."
Uh, I don't want to make it impossible to debug race conditions in Python.
"It should also be noted that the technique will not work for data that contains pointers, [...]"
Dict objects contain pointers.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/larryhastings/gilectomy/pull/33#issuecomment-285867783, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyWONKH_omdmM95vqJx0a3vSl4egTx7ks5rkqY7gaJpZM4IzHEo .