cpython icon indicating copy to clipboard operation
cpython copied to clipboard

`concurrent.interpreters.Queue` performance is bound to the polling delay

Open x42005e1f opened this issue 3 weeks ago • 7 comments

Bug report

Bug description:

Currently, all blocking operations in concurrent.interpreters.Queue work via polling. In general, this approach is satisfactory for an initial implementation (unfortunately, this is a cancer of the asynchronous world), but it suffers from one very unpleasant performance issue. Namely, the maximum number of operations per second is effectively bounded by the chosen delay:

>>> # `concurrent.interpreters.Queue`: a simple echo benchmark
>>> from concurrent import interpreters
>>> from timeit import timeit
>>> iterations = 100
>>> in_q = interpreters.create_queue()
>>> out_q = interpreters.create_queue()
>>> out_q.put(None)
>>> def echo(in_q, out_q):
...     while True:
...         out_q.put(in_q.get())
>>> interpreters.create().call_in_thread(echo, out_q, in_q)
>>> iterations / timeit(lambda: out_q.put(in_q.get()), number=iterations)
61.944790966792546  # OPS, >10 milliseconds per operation (`_delay=10 / 1000`)

It is set by the undocumented _delay parameter and is essentially a compromise between processor load and response speed (which I also described in a comment to one PR). Neither queue queues nor multiprocessing queues use polling and therefore have much higher performance:

>>> # `queue.SimpleQueue`: a simple echo benchmark
>>> from queue import SimpleQueue
>>> from threading import Thread
>>> from timeit import timeit
>>> iterations = 100_000
>>> in_q = SimpleQueue()
>>> out_q = SimpleQueue()
>>> out_q.put(None)
>>> def echo(in_q, out_q):
...     while True:
...         out_q.put(in_q.get())
>>> Thread(target=echo, args=[out_q, in_q]).start()
>>> iterations / timeit(lambda: out_q.put(in_q.get()), number=iterations)
154248.64387142067  # OPS, <10 microseconds per operation
>>> # `queue.Queue`: a simple echo benchmark
>>> from queue import Queue
>>> from threading import Thread
>>> from timeit import timeit
>>> iterations = 100_000
>>> in_q = Queue()
>>> out_q = Queue()
>>> out_q.put(None)
>>> def echo(in_q, out_q):
...     while True:
...         out_q.put(in_q.get())
>>> Thread(target=echo, args=[out_q, in_q]).start()
>>> iterations / timeit(lambda: out_q.put(in_q.get()), number=iterations)
37413.69358798779  # OPS, <50 microseconds per operation
>>> # `multiprocessing.Queue`: a simple echo benchmark
>>> from multiprocessing import Process, Queue, set_start_method
>>> from timeit import timeit
>>> set_start_method("fork")
>>> iterations = 10_000
>>> in_q = Queue()
>>> out_q = Queue()
>>> out_q.put(None)
>>> def echo(in_q, out_q):
...     while True:
...         out_q.put(in_q.get())
>>> Process(target=echo, args=[out_q, in_q]).start()
>>> iterations / timeit(lambda: out_q.put(in_q.get()), number=iterations)
13411.797245613465  # OPS, <100 microseconds per operation

I marked this as a bug, since such low performance can hardly be considered expected behavior for a lightweight alternative to multiprocessing. It would be much better to implement truly blocking behavior.

CPython versions tested on:

3.14

Operating systems tested on:

Linux

x42005e1f avatar Dec 20 '25 08:12 x42005e1f