Add support for dropping the GIL while performing COBS operations
In my application, I decode sometimes very large (megabytes) COBS packets while also handling soft-realtime IO operations where a missed deadline means an application-level failure. I would be able to eliminate these failures were I was able to offload COBS operations to another thread, but currently the C extension holds the GIL while decoding. If it unlocked the GIL I would be able to get much lower latency in my application.
I would consider doing this.
I'm not familiar with the topic, so I'm unsure as to what the design considerations are.
I had mildly considered this in the past, however, I thought releasing the GIL is mostly advised for a scenario in which a processor is waiting on I/O, and less likely to be useful when the processor is doing something computationally intensive. I didn't think it would be likely that anyone would use COBS on such large packets that it would be beneficial to release the GIL.
But, it sounds as though yours is a case that breaks all my previous assumptions.
I assume for many/most users who operate on smaller packets, releasing and reacquiring the GIL could impose a performance penalty.
If the C extension were to release the GIL, what would ensure the source byte-string or other bytes-object would not be modified by any other Python thread during the COBS encode/decode operation?
I had mildly considered this in the past, however, I thought releasing the GIL is mostly advised for a scenario in which a processor is waiting on I/O, and less likely to be useful when the processor is doing something computationally intensive.
In both of these cases you are officially encouraged to drop GIL (numpy is doing this for example).
I didn't think it would be likely that anyone would use COBS on such large packets that it would be beneficial to release the GIL.
In my application, I plan to use COBS as the primary framing mechanism used for all non-trivial communication with the hardware (for which I developed a high-performance COBS encoder/decoder as well). It is extremely important to me that it is as high-throughput and low-latency as possible. The packet sizes are essentially unbounded (well, bounded by RAM); I can envision receiving a 100 MB packet in some circumstances. Of course right now this would be pretty bad latency-wise since decoding isn't incremental, but we can cross that bridge when we have to.
I assume for many/most users who operate on smaller packets, releasing and reacquiring the GIL could impose a performance penalty.
In most modern implementations, taking an uncontended lock is essentially free unless you're doing it in a busy loop (which isn't the case if we're executing Python bytecode already), so this shouldn't affect single-threaded applications at all. I don't think there's going to be much of a hit to multithreaded applications, but that I would want to measure before being confident about it.
If the C extension were to release the GIL, what would ensure the source byte-string or other bytes-object would not be modified by any other Python thread during the COBS encode/decode operation?
I would do something like...
- allocate a new
bytearray, large enough to hold both source and destination; you are now its sole owner, so you can freely modify its data without risk of races - memcpy the data into the new object
- release GIL
- encode/decode in-place
- acquire GIL
- return the new object
- memcpy the data into the new object
Wouldn't the extra memcpy make the algorithm considerably slower?
It is extremely important to me that it is as high-throughput and low-latency as possible.
I wrote a Rust crate for cobs. :)
A memcpy can be usually done at close to the memory bandwidth and it is vectorized. Since your algorithm is scalar there is no way it will have a comparable speed, and will take up the bulk of the runtime. This should not be difficult to measure. (Also, unless I'm missing something, you'd be making a copy either way?)
Re Rust: I've considered using https://crates.io/crates/cobs, but it raises questions about deployment (e.g. I have a Pyodide based version) that I'm not ready to deal with yet.