streamz Speed tests with a Stream?

I am trying to compare the speed of operation when a Stream is introduced into the code. These are both scripts that need to be run.

A 'plain' Python example:

from timeit import default_timer as timer
LIMIT = 100000

def sequencer(limit=LIMIT):
    num = 0
    while num < limit:
        yield num
        num += 1

L = []
start = timer()
for i in sequencer():
   L.append(i)

end = timer()
print(end - start)

This takes about 17 milliseconds to run.

An attempt to replicate the above with streamz:

from timeit import default_timer as timer
from streamz import Stream
LIMIT = 100000

def sequencer(limit=LIMIT):
    num = 0
    while num < limit:
        yield num
        num += 1

source = Stream.from_iterable(sequencer())
L = source.sink_to_list()

start = timer()
source.start()
while True:
    if len(L) >= LIMIT:
        break

end = timer()
print(end - start)

This takes about 1200 milliseconds to run. Is this the expected slowdown (two orders-of-magnitude)?

I am not sure though if the streamz code is correct? Its based off of the examples in the docs, but those examples all seem geared towards use in the shell rather than in standalone scripts. If you omit the while True: section, then no data ends up in the output list?

May 24 '21 09:05 gamesbook

The overhead of streamz is essentially from running async coroutines. Each task adds a small overhead to the call, so this will show up as significant in cases where the function itself is extremely fast - like this case. In all normal operation, ~10us of overhead per task would be totally negligible. You could use an event loop other than the standard asyncio one to mitigate the issue, if it actually is significant for you.

May 24 '21 15:05 martindurant

@martindurant I agree this is unlikely to be at all significant in Real World cases. I was just trying to compare streamz to an in-house streaming framework and the "do almost nothing" case was the most straightforward. Its useful to know that the async coroutines are the contributors.

(You did not comment on the use of the while loop, so I assume this is the way to go?)

May 24 '21 18:05 gamesbook

You did not comment on the use of the while loop, so I assume this is the way to go?

I don't have anything against it in principle, except that repeatedly checking the list size may itself be slowing down execution (because of python's GIL). Perhaps add some short sleep? That will add a small value to the measured time, but divided between the many iterations of the loop.

May 24 '21 18:05 martindurant

I will try that.

I was just wondering if there was any other way to 'wait' for the stream to finish processing. In the console you don't need this and all the data from the source will end up in the sink; but when you run it as a script (without the while), the sink is just empty.

May 25 '21 05:05 gamesbook

I probably should have ended here by saying that the tests commonly use a wait_for function, which is essentially a sleep-check loop. In any case, I think this conversation came to a reasonable end? I wonder how well streamz stood against your in-house framework.

Oct 08 '21 12:10 martindurant