smart_open Fix quadratic time ByteBuffer operations

Fix quadratic time ByteBuffer operations

Open Joshua-Landau-Anthropic opened this issue 3 years ago • 4 comments

trafficstars

bytes is immutable, so repeated appends is quadratic-time.

bytearray is the mutable version of bytes.

Jul 29 '22 10:07 Joshua-Landau-Anthropic

Cool! Are you able to run any benchmarks?

Jul 30 '22 12:07 mpenkov

$ python benchmark/bytebuffer_bench.py
0.0012826919555664062
$ git stash; git checkout HEAD^; git stash pop
$ python benchmark/bytebuffer_bench.py
0.13568520545959473

Benchmark is

import time
from smart_open.bytebuffer import ByteBuffer

buffer = ByteBuffer()

start = time.time()
for _ in range(1000):
    assert buffer.fill([b"X" * 1000]) == 1000
print(time.time() - start)

In terms of production code, I found this is a significant speedup for large unpickles.

Jul 31 '22 04:07 Joshua-Landau-Anthropic

A benchmark on some real smart_open use-case (closer to what users see / care about) would be great. I am myself curious of the difference.

Jul 31 '22 06:07 piskvorky

$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 0.012549877166748047
File read benchmark 1.083221435546875

$ git revert 8cc7654580b207dc7a684fab856ef1a708b106b1
[develop a5fc7a7] Revert "Fix quadratic time ByteBuffer operations"

$ python benchmark/bytebuffer_bench.py "$BENCHMARK_FILE"
Raw ByteBuffer benchmark: 21.479155778884888
File read benchmark 25.684982776641846

BENCHMARK_FILE is any large S3 file you have a fast connection to.

Aug 01 '22 07:08 Joshua-Landau-Anthropic

Merged! Thank you.

Aug 21 '22 13:08 mpenkov

Would it be possible to get a release containing this fix so it's easier for us to upgrade? Thanks in advance!

Sep 12 '22 22:09 nelhage

OK, done.

https://pypi.org/project/smart-open/6.2.0/

Sep 14 '22 02:09 mpenkov

Thank you!

Sep 14 '22 17:09 nelhage

smart_open smart_open copied to clipboard

Fix quadratic time ByteBuffer operations

smart_open
smart_open copied to clipboard