cpython gh-88500: Reduce memory use of `urllib.unquote`

urllib.unquote_to_bytes and urllib.unquote could both potentially generate O(len(string)) intermediate bytes or str objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding bytearray and a generator internally instead of precomputed split() style operations.

Issue: gh-88500

Closes #88500.

Sep 12 '22 08:09 gpshead

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

Sep 12 '22 23:09 gpshead

any thoughts from reviewers?

Oct 01 '22 18:10 gpshead