smart_open
smart_open copied to clipboard
Fix quadratic time ByteBuffer operations
bytes is immutable, so repeated appends is quadratic-time.
bytearray is the mutable version of bytes.
Cool! Are you able to run any benchmarks?
$ python benchmark/bytebuffer_bench.py
0.0012826919555664062
$ git stash; git checkout HEAD^; git stash pop
$ python benchmark/bytebuffer_bench.py
0.13568520545959473
Benchmark is
import time
from smart_open.bytebuffer import ByteBuffer
buffer = ByteBuffer()
start = time.time()
for _ in range(1000):
assert buffer.fill([b"X" * 1000]) == 1000
print(time.time() - start)
In terms of production code, I found this is a significant speedup for large unpickles.
A benchmark on some real smart_open use-case (closer to what users see / care about) would be great. I am myself curious of the difference.
$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 0.012549877166748047
File read benchmark 1.083221435546875
$ git revert 8cc7654580b207dc7a684fab856ef1a708b106b1
[develop a5fc7a7] Revert "Fix quadratic time ByteBuffer operations"
$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 21.479155778884888
File read benchmark 25.684982776641846
BENCHMARK_FILE is any large S3 file you have a fast connection to.
Merged! Thank you.
Would it be possible to get a release containing this fix so it's easier for us to upgrade? Thanks in advance!
OK, done.
https://pypi.org/project/smart-open/6.2.0/
Thank you!