atomics
atomics copied to clipboard
Performance rather slow
First of all many thanks for your great library. It's the only one I've found that can do atomic operations on a shared memory in Python. I'm using it for locking in my dictionary implementation that uses shared memory.
Though, I was wondering what kind of performance can be expected? On the Github README you have an incorrect example with a counter up to 10 million and a correct one that uses the atomics library (but the counter does only go to 10k.
The incorrect one takes 0.34 s on my computer and the correct one using atomics library takes 41.38 s if I count to the same target of 10 million.
If I correct the incorrect one in a classical way using threading.RLock() it takes 3.68 s which is still factor 10 faster.
In my dictionary implementation, I'm using AtomicBytesView.exchange() to set my shared lock because it seemed faster to compare myself than using cmpxchg_weak(). I've also found that load() seems rather slow.
Do you have any idea how to get better (similar) performance to RLock() when using atomics for setting a lock flag in shared memory? This is basically setting a byte to 1 or 0.
Hi,
I'm honestly happy that anyone's using this library; I started it as a personal project for my own stuff, and figured it would be too specific a niche for most people.
You're right that the current implementation is very slow. This is entirely due to the fact that it's implemented in pure Python with ctypes calls to the C library. When I was developing this library, I wrote another version using CPython's C-API (and maybe one using Cython too?), which was significantly faster. To give an example, .fetch_add(n)
, when written in C, was twice as fast as Python's int
type doing += n
.
I decided to go with the the pure Python implementation because it was much more portable. The C library patomic
will compile with any ANSI C compiler on any architecture, and atomics
only requires Python3.6 or higher.
However since you've noticed that it's very slow, I'm happy to bring out my C implementation. The only downside is it might take me around a month, since I'm quite busy at the moment. Hopefully you're not in a huge rush (it's a little unfortunate that you can't do anything on your end to speed it up).
Hi, thanks for your quick response.
I'm not in a rush, so no worries. I've also explored Cython while developing my shared dict and I guess the ideal situation would be to have a pure Python implementation and then optionally provide some (Cython?) c extension that will improve performance if available. Though, I understand this is extra effort in maintaining two implementations and testing them, etc.
Again, many thanks and looking forward to the C implementation. Happy to help testing or whatever is necessary :)
@ronny-rentner did you have any concerns about the API? I'm rewriting it over the next week, so if you felt that some part of it could be improved, now would be the best time to have a breaking change (not that an improvement needs to be breaking).
Hey,
if you check out https://github.com/ronny-rentner/UltraDict/blob/main/UltraDict.py#L64 you can see how I use your library for my SharedLock. My main use case for atomics is to have a fast inter process lock in shared memory. For shared locks, I need an atomic "test and set" operation.
In the beginning I was using it as a context manager but then realized it has a heavy performance impact. Thus I now run manually the __enter__()
method to obtain the context (called self.lock_atomic
in the Sharedlock) and I also manually close the context on exit.
The 2 methods that are using the context are test_and_inc() and test_and_dec() even though they work the other way round, so they first inc and then test.
atomics says it's lock free, but I have no clue how to make my test_and_inc() method block until it's possible to inc from zero to one. So I do run a busy wait loop. I thought maybe I could use cmpxchg_weak
but in your example it also looks like it would busy wait:
import atomics
def atomic_mul(a: atomics.INTEGRAL, operand: int):
res = atomics.CmpxchgResult(success=False, expected=a.load())
while not res:
desired = res.expected * operand
res = a.cmpxchg_weak(expected=res.expected, desired=desired)
What I am doing is this:
def test_and_inc(self):
old = self.lock_atomic.exchange(b'\x01')
if old != b'\x00':
# Oops, someone else was faster than us
return False
return True
Let me know if this helps in improving the API.
One big issue in Python is that you cannot wait for nanoseconds. The smallest amount for time.wait()
depends on the specific OS and kernel, but it can be like 10 ms . Only in the next Python version there will be an improvement.
It sounds like you want a lock; the issue being that atomic types (at least in this library) are intended to be lock-free, so they will always busy wait.
I think an ideal solution for that would combine atomics
, multiprocessing.shared_memory
, and a NamedSemaphore. You would create the shared memory, and put an atomic object at the start of it. You would then put the handle (e.g. HANDLE
on windows and int
on unix) of a NamedSemaphore as the next object in your shared memory, using the atomic object to avoid any races while doing this. You could then use the semaphore as normal (the setup is a little more complicated, but that's the gist of it).
Unfortunately, I can't find a portable implementation of NamedSemaphore
in Python (which is surprising). An implementation would need to have a .handle()
, .native_handle()
, or .name()
method so that you could store that in shared memory. Since such an implementation doesn't exist, I'd be happy to write one after I'm finished porting this library (or you could write one if you want).
Other than that, did you have any issues with the existing API that could be improved? Or should I keep it the same?
Thanks, you can keep it the same, maybe have an additional nicer name for __enter__()
but that's just a detail.
Hey, any update on this? :)
I am interested too 😊