cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-88500: Reduce memory use of `urllib.unquote`

Open gpshead opened this issue 3 years ago • 2 comments

urllib.unquote_to_bytes and urllib.unquote could both potentially generate O(len(string)) intermediate bytes or str objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding bytearray and a generator internally instead of precomputed split() style operations.

  • Issue: gh-88500

Closes #88500.

gpshead avatar Sep 12 '22 08:09 gpshead

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

gpshead avatar Sep 12 '22 23:09 gpshead

any thoughts from reviewers?

gpshead avatar Oct 01 '22 18:10 gpshead