cpython
cpython copied to clipboard
gh-90716: add _pylong.py module
Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API.
Currently implemented is int_to_decimal_string()
(based on code from Tim), int_divmod()
(based on code from Mark), and str_to_int()
(based on code from bjorn-martinsson). The longobject.c module has been changed to call into _pylong
if the Python version is likely to be faster.
Examples of performance improvements:
int->str
python -m timeit -s "i=(1<<2_000_000) - 1" "str(i)"
Before: 1 loop, best of 5: 3.58 sec per loop
After: 2 loops, best of 5: 168 msec per loop
str->int
python -m timeit -s "s='1'*1_000_000" "int(s)"
Before: 1 loop, best of 5: 3.65 sec per loop
After: 1 loop, best of 5: 470 msec per loop
divmod
python -m timeit -s "a=(1<<20_000_000)-1; b=(1<<100_000)" "divmod(a, b)"
Before: 1 loop, best of 5: 2.23 sec per loop
After: 2 loops, best of 5: 141 msec per loop
Co-author: Tim Peters [email protected] Co-author: Mark Dickinson [email protected] Co-author: Bjorn Martinsson
- Issue: gh-90716
Some relevant discussion on the pros and cons of adding these "smarter" but more complicated algorithms:
https://discuss.python.org/t/faster-large-integer-multiplication/13300
I think implementing them in Python lowers the bar quite a bit in terms of what is maintainable in the long term. We have a small number of core devs who are numeric experts so we need to be cautious about adding a bunch of complex code. Maybe this PR already goes too far? It feels like pretty good "bang for buck" to me.
About long_to_decimal, here are some things that I've noticed/thoughts.
- Increasing the size of BITLIM to something like 32768 gives a noticable performance boost for me.
- I think it is more natural to add this algorithm to the decimal module. This code is essentially just a smart way of implementing the constructor of
decimal.Decimal
when called with an int. - The code is really slow in PyPy. I'm guessing this is because decimal is slow in PyPy.
- I think the current implementation is a bit hard to understand. Maybe something like this would make it easier to follow:
D = decimal.Decimal
mem = {1: D(2)}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
BITLIM = 32768
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * w2pow(w2) + inner(lo, w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
result = inner(n, n.bit_length())
Btw there is an if 0:
in divmod_fast
which I'm assuming is not supposed to be there.
-
No thought whatsoever, or timing, went into the current
BITLIM
value. I see that, on my disk, I last left it as 3 :wink:. So as to provoke the code into doing something on "almost all" inputs, for testing. -
The code doesn't belong in
decimal
. Leaving aside thatdecimal
is an external project (it's not "our" code), if this experiment is accepted it's intended to be a repository for all coded-in-Python bigint accelerators. If theDecimal()
constructor wants to exploit it (which it should!), fine, CPython'sdecimal
wrapper should import this module and call itslong_to_decimal()
function as needed. -
The original code aimed to build an efficient power tree as efficiently as reasonably possible. Toward that end, it looked at each plausible entry only once, didn't use recursion, and aimed to minimize the number of dict entries. For example, for bit width 1,000,001 and
BITLIM=128
, it creates a dict with 16 keys, the smallest of which is 122. Your alternative creates a dict with 27 keys. At yourBITLIM=32768
, the same for your code's dict, but the original only constructs 6 dict entries. Is that worth a lot? Nope. But since I don't find it hard to understand, I'm not inclined to change it either. Neil can do whatever he likes - fine by me. -
In any case, a persistent dict is a Bad Idea here. It will only grow over time, to no good end. Regardless of which dict-building code is used, it needs only
O(log log n)
entries, so there's scant overlap expected between different top-level arguments. In your code much more so, since it creates a chain of entries all the way down to key 1. The original code never creates more than 2 entries with key <BITLIM
. -
if 0:
is a common way to "comment out" a block of executable code intended for, e.g., debugging. No code is generated for the block; the compiler throws away theif 0
test - and the block - entirely. Nothing wrong with leaving such code in. -
I was surprised to see that
x + x
really is generally faster thanx*2
indecimal
- good catch 😃
I agree that persistent dictionaries are bad, but I never intended mem
to be one.
About the number of keys used. I fixed my implementation to use fewer keys. Now I believe it uses the same number of keys, or sometimes 1 less key than your implementation. For example, try it out with bit width 1,000,001. For BITLIM=128
I use 15 keys, and your implementation uses 16.
mem = {}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
elif w <= BITLIM:
mem[w] = D2 ** w
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
BITLIM = 32768 # 128
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
result = inner(n, n.bit_length())
With this implementation, w2pow()
is called in smart order (it is called with small values of w
first, then bigger values of w), making it only have recursion depth at most 2.
I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.
With this implementation,
w2pow()
is called in smart order (it is called with small values ofw
first, then bigger values of w), making it only have recursion depth at most 2.
That seems wrong. Trace through, e.g., w2pow(2**20)
. That invokes
mem[w] = w2pow(w2) * w2pow(w - w2)
which makes a recursive call with 2**19
, which in turn makes a recursive call on the same line with 2**18
, and so on, until the function is finally called with 2**15
(your BITLIM
value). Only then does the recursion start to unwind.
Hi bjorn-martinsson, I appreciate the interest in improving the functions in _pylong.py
but polishing the functions is not the primary focus of this PR. Perhaps we should create another issue to focus on that? If they produce correct results and perform reasonably well, I think that's good enough for now. So, I feel like long_to_decimal()
is in pretty okay shape already.
I'm less confident about divmod_fast()
since Mark has not looked at this PR yet, AFAIK. I extracted the code from an old issue and I believe it was some prototype code, not intended for serious use.
Writing more extensive unit tests and scripts to produce timing/benchmark reports is perhaps the next step. long_to_decimal
doesn't handle the bytes writer case, need to fix that. Need to come up with better threshold tests of when to switch to the _pylong functions. I'm using Py_ABS(Py_SIZE(n)) > 1000
but that's almost certainly not the right thing.
Based on comments from njs, I wonder if _pylong should use ContextVar and then we could move the max_str_digits
checking into there. Getting away from a global interpreter setting would be quite a nice improvement. Doing it inside _pylong
should be a lot easier.
Neil, I think the one basic thing this is missing is an asymptotically faster str->int function. Conversions in both directions were the real topic of the original issue report, although it focused on int->str (as "the harder" problem). Both are quadratic-time today.
@bjorn-martinsson posted suitable code for str->int (in a comment) in gh-90716. It's much like the int->str here, but sticks to native Python int arithmetic so only gets "Karatsuba-like" speedup. Which is huge, but not as huge as we can get from decimal
's fancier *
. But I don't see a sane way to apply decimal
in the decimal string -> binary int direction. The result has to be a CPython bigint, and conversion from Decimal
to int
is also quadratic-time(*).
(*) But could be faster after this PR lands (convert the Decimal
to a decimal string (linear time in # of digits), then apply the hypothetical new faster-than-quadratic str->int function).
Some quick timing scripts. time_divmod_fast.py.txt time_long_to_dec.py.txt
Results on my machine:
time_divmod_fast.py
bits old time new time faster correct
------------------------------------------------------------------------------
54 0.000 0.000 0.18 True
107 0.000 0.000 0.17 True
213 0.000 0.000 0.17 True
426 0.000 0.000 0.26 True
851 0.000 0.000 0.62 True
1701 0.000 0.000 0.24 True
3402 0.000 0.000 0.39 True
6804 0.000 0.000 0.56 True
13607 0.000 0.000 1.06 True
27214 0.001 0.000 1.90 True
54427 0.002 0.001 2.31 True
108853 0.008 0.002 3.97 True
217706 0.026 0.005 5.26 True
435412 0.105 0.014 7.40 True
time_long_to_dec.py
bits old new ratio equal
1025 0.000 0.000 0.1x True
2049 0.000 0.000 0.2x True
4097 0.000 0.000 0.4x True
8193 0.000 0.000 0.7x True
16385 0.000 0.000 1.0x True
32769 0.001 0.001 1.5x True
65537 0.005 0.003 2.0x True
131073 0.016 0.005 3.0x True
262145 0.062 0.012 5.2x True
524289 0.246 0.026 9.5x True
1048577 0.980 0.054 18.0x True
2097153 3.926 0.114 34.5x True
Neil, I think the one basic thing this is missing is an asymptotically faster str->int function.
Ah, okay. I extracted bjorn-martinsson's code from GH-90716. A little timing script is attached. Timing from my machine:
time_str_to_long.py
digits old time new time faster correct
------------------------------------------------------------------------------
1024 0.000 0.000 1.50 True
2048 0.000 0.000 0.98 True
4096 0.000 0.000 0.93 True
8192 0.001 0.001 1.21 True
16384 0.002 0.001 2.47 True
32768 0.005 0.002 2.43 True
65536 0.019 0.005 3.48 True
131072 0.061 0.016 3.78 True
262144 0.263 0.049 5.39 True
524288 0.983 0.147 6.68 True
I'm working on a change to longobject.c to use it. The details are a bit tricky though (leading/trailing whitespace, etc).
I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.
I just thought that the recursive formulation looked nicer and is more straight forward. But I agree that this is completely subjective.
With this implementation,
w2pow()
is called in smart order (it is called with small values ofw
first, then bigger values of w), making it only have recursion depth at most 2.That seems wrong. Trace through, e.g.,
w2pow(2**20)
. That invokesmem[w] = w2pow(w2) * w2pow(w - w2)
which makes a recursive call with
2**19
, which in turn makes a recursive call on the same line with2**18
, and so on, until the function is finally called with2**15
(yourBITLIM
value). Only then does the recursion start to unwind.
I still belive I'm correct, but the reason for why is a bit subtle. Notice that in my implementation, the first time inner()
calls w2pow(w2)
it does it with a small w2
. So the first call will not lead to a 2^20 -> 2^19 -> 2^18 -> ... kind of recursion. In fact, the deepest it will ever go is depth 2.
I've complied the three different styles for int -> str
conversion and str -> int
.
The first style is the one I used in the comment in gh-90716. It is very simple, but slightly slower than the other two styles.
The second style is the recursive style with mem
that I posted above.
The third style is what Tim used.
I think any of these styles would work, just be consistent and use the same style for both str -> int
and int -> str
.
from_str_to_int.py
def str_to_int_only_pow2_powers(s):
DIGLIM = 2048
pow5 = [5]
for _ in range((len(s) - 1).bit_length() - 1):
pow5.append(pow5[-1] * pow5[-1])
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s))
def str_to_int_recursive(s):
DIGLIM = 2048
mem = {}
def w5pow(w):
""" return 5**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] * 5
elif w <= DIGLIM:
mem[w] = 5 ** w
else:
w2 = w >> 1
mem[w] = w5pow(w2) * w5pow(w - w2)
return mem[w]
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
split = (r - l) >> 1
return _str_to_int(r - split, r) + ((_str_to_int(l, r - split) * w5pow(split)) << split)
return _str_to_int(0, len(s))
def str_to_int_iterative(s):
DIGLIM = 2048
w5pow = {}
w = len(s)
while w >= DIGLIM:
w2 = w >> 1
if w & 1:
w5pow[w2 + 1] = None
w5pow[w2] = None
w = w2
if w5pow:
it = reversed(w5pow.keys())
w = next(it)
w5pow[w] = 5 ** w
for w in it:
if w - 1 in w5pow:
val = w5pow[w - 1] * 5
else:
w2 = w >> 1
val = w5pow[w2] * w5pow[w - w2]
w5pow[w] = val
def _str_to_int(l, r):
if r - l <= DIGLIM:
return int(s[l:r])
split = (r - l) >> 1
return ((_str_to_int(l, r - split) * w5pow[split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s))
from_int_to_str.py
import decimal
def int_to_str_only_pow2_powers(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
pow2 = [D(2)]
for _ in range((n.bit_length() - 1).bit_length() - 1):
pow2.append(pow2[-1] * pow2[-1])
def inner(n, w):
if w <= BITLIM:
return D(n)
lg_w2 = (w - 1).bit_length() - 1
w2 = 1 << lg_w2
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * pow2[lg_w2] + inner(lo, w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result)
def int_to_str_recursive(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
mem = {}
def w2pow(w):
""" return D(2)**w and store the result """
if w not in mem:
if w - 1 in mem:
mem[w] = mem[w - 1] + mem[w - 1]
elif w <= BITLIM:
mem[w] = D(2) ** w
else:
w2 = w >> 1
mem[w] = w2pow(w2) * w2pow(w - w2)
return mem[w]
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result)
def int_to_str_iterative(n):
BITLIM = 32768
if n < 0:
negate = True
n = -n
else:
negate = False
D = decimal.Decimal
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
w2pow = {}
w = n.bit_length()
while w >= BITLIM:
w2 = w >> 1
if w & 1:
w2pow[w2 + 1] = None
w2pow[w2] = None
w = w2
if w2pow:
it = reversed(w2pow.keys())
w = next(it)
w2pow[w] = D(2)**w
for w in it:
if w - 1 in w2pow:
val = w2pow[w - 1] + w2pow[w - 1]
else:
w2 = w >> 1
assert w2 in w2pow
assert w - w2 in w2pow
val = w2pow[w2] * w2pow[w - w2]
w2pow[w] = val
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(hi, w - w2) * w2pow[w2] + inner(lo, w2)
result = inner(n, n.bit_length())
if negate:
result = -result
return str(result)
About str_to_int
needing a faster big int multiplication to be significantly faster than int()
. I think that it is worth adding str_to_int
even if no super fast big int multiplication currently exists in Python. If in the future Python adds faster big int multiplication, then str_to_int
will run fast without needing to be modified in any way.
Finally, about implementing fast big int multiplication. I have not yet tried coding the Schönhage–Strassen algorithm in Python, but I'm fairly familiar with FFT algorithms and I believe I could get a pretty clean looking implementation using about 10 - 20 lines of Python code.
I agree these should be as similar as possible. I don't really care which style is picked.
I have no interest in being as fast as possible here. "Major improvement" is the goal. "Just" Karatsuba-like speedup is, as said before, huge. Ambition has forever been the fatal enemy of "major improvement" here, and after 30 years of that I'm tired of it :wink:.
For a given value, int->str is currently way slower than str->int, which is why I focused on the former.
$ timeit -s "i = 10**100000 - 1; s = str(i)" "str(i)"
2 loops, best of 5: 160 msec per loop
$ timeit -s "i = 10**100000 - 1; s = str(i)" "int(s)"
5 loops, best of 5: 58.2 msec per loop
But improving the asymptotics of str->int is also of real value.
I've made some more updates, getting close to merge quality now. I would like some more unit tests yet and a good code review. My knowledge of the C-API is a bit out of date (e.g. PyUnicode_Writer, etc). I've renamed Mark's divmod_fast()
to int_divmod
, to be more consistent. I hope he might have time to give feedback on it (e.g. is it fit for purpose).
Ambition has forever been the fatal enemy of "major improvement" here
Yeah some improvement from the status quo is my main focus with this PR. I'm pretty sure people will come along and improve _pylong.py
afterwards. Perhaps we will need to pushback to prevent it from growing a lot of complexity. However, even with the simple and short version I have now, it seems we are getting some decent speedups, if the integers are large enough.
Adding in a ContextVar based mechanism seems like neat idea but falls under the ambition category. Also, I think it could be worth considering making a semi-public API so packages like gmpy2 could monkey-patch _pylong
. Probably they should be able to adjust the threshold levels for using _pylong
as well, since those implementations have a different cutover performance point.
As I discussed before, I think it is important to be consistent with the style.
Currently int_to_decimal
and _str_to_int_inner
are coded in different ways even though they fundamentally do the exact same thing. So their implementations could and should look almost the same.
So therefore, I think you should switch _str_to_int_inner
to the str_to_int_iterative()
function from from_str_to_int.py
that I posted above. Doing that will make int_to_decimal
and _str_to_int_inner
have the same style.
I think it is important to be consistent with the style.
Okay. To my eye, the recursive version reads easier and seems to perform just as well. I made a gist with the iterative version:
https://gist.github.com/nascheme/d04e6d63cfbd3e4614ca0e173b961de4
Is that correct? It is more lines of code but I do see how it is more similar to int_to_decimal
.
Would you be able to create a recursive version of int_to_decimal
? I'd like to see what it looks like. The pow5()
seems quite nice to me. I'd assume you would need a similar pow2()
function.
EDIT: oops! This originally forgot to add mem[w] = result
to cache the result. No effect on correctness, just speed. The problem with using @functools.cache
is that then the cache is "invisible", so we can't test whether w-1
is in it.
Would you be able to create a recursive version of
int_to_decimal
? I'd like to see what it looks like.
He already did, but hiding in a file in a comment that doesn't show up unless you click on the right thing. I'll pull it out here, but renamed, and reworked some to be more of a straight drop-in replacement for what you already have, and with some comments. I agree the recursive form reads well.
def int_to_decimal(n):
"""Asymptotically fast conversion of an 'int' to Decimal."""
# Function due to Tim Peters. See GH issue #90716 for details.
# https://github.com/python/cpython/issues/90716
#
# The implementation in longobject.c of base conversion algorithms
# between power-of-2 and non-power-of-2 bases are quadratic time.
# This function implements a divide-and-conquer algorithm that is
# faster for large numbers. Builds an equal decimal.Decimal in a
# "clever" recursive way. If we want a string representation, we
# apply str to _that_.
if _DEBUG:
print('int_to_decimal', n.bit_length(), file=sys.stderr)
D = decimal.Decimal
D2 = D(2)
BITLIM = 32768
mem = {}
def w2pow(w):
"""Return D(2)**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to Decimal.
"""
if (result := mem.get(w)) is None:
if w <= BITLIM:
result = D2 ** w
elif w - 1 in mem:
result = (t := mem[w - 1]) + t
else:
w2 = w >> 1
# If w happens to be odd, w-w2 is one
# larger then w2 now. Recurse on the
# smaller first (w2), so that it's in
# the cache and the larger (w-w2) can
# be handled by the cheaper `w-1 in mem`
# branch instead.
result = w2pow(w2) * w2pow(w - w2)
mem[w] = result
return result
def inner(n, w):
if w <= BITLIM:
return D(n)
w2 = w >> 1
hi = n >> w2
lo = n - (hi << w2)
return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
with decimal.localcontext() as ctx:
ctx.prec = decimal.MAX_PREC
ctx.Emax = decimal.MAX_EMAX
ctx.Emin = decimal.MIN_EMIN
ctx.traps[decimal.Inexact] = 1
if n < 0:
negate = True
n = -n
else:
negate = False
result = inner(n, n.bit_length())
if negate:
result = -result
return result
Here is a _str_to_long_inner
implementation in the same exact style as the one Tim just posted.
def _str_to_long_inner(s):
"""Asymptotically fast conversion of a 'str' to an 'int'."""
# Function due to Bjorn Martinsson. See GH issue #90716 for details.
# https://github.com/python/cpython/issues/90716
#
# The implementation in longobject.c of base conversion algorithms
# between power-of-2 and non-power-of-2 bases are quadratic time.
# This function implements a divide-and-conquer algorithm making use
# of Python's built in big int multiplication. Since Python uses the
# Karatsuba algorithm for multiplication, the time complexity
# of this function is O(len(s)**1.58).
DIGLIM = 2048
mem = {}
def w5pow(w):
"""Return 5**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to 'int'.
"""
if (result := mem.get(w)) is None:
if w <= DIGLIM:
result = 5 ** w
elif w - 1 in mem:
result = mem[w - 1] * 5
else:
w2 = w >> 1
# If w happens to be odd, w-w2 is one
# larger then w2 now. Recurse on the
# smaller first (w2), so that it's in
# the cache and the larger (w-w2) can
# be handled by the cheaper `w-1 in mem`
# branch instead.
result = w5pow(w2) * w5pow(w - w2)
mem[w] = result
return result
def inner(a, b):
if b - a <= DIGLIM:
return int(s[a:b])
mid = (a + b + 1) >> 1
return inner(mid, b) + ((inner(a, mid) * w5pow(b - mid)) << (b - mid))
return inner(0, len(s))
Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on str(int)
conversion to protect against DoS attack?
Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on
str(int)
conversion to protect against DoS attack?
That's not my goal, and I doubt it's Neil's either. The primary goal is to make decimal_str <-> int conversions asymptotically faster on large values, for the sake of speed. Their current quadratic-time behavior has been an ongoing background complaint for decades.
I feel this is ready for review now. Removing Mark as a requested reviewer since he mentioned he might not have time to do it, at least anytime soon.
I don't think there is any huge rush to get this in. I was hoping it might help reduce the pain of CVE-2020-10735 mitigations. However, the threshold for kicking into the _pylong
functions is quite a bit higher than the default 4300 limit. Maybe with some more work we can get there. Also, implementing a ContextVar system in _pylong
and then putting the int_max_str_digits
check in there could be a good idea. That can be done as a separate PR though.
The int->str and str->int code are very similar in concept, and should be next to each other in the code (not with unrelated division between them).
While the concepts are very similar, the code is currently quite different in essentially pointless ways. Within the last day, @bjorn-martinsson posted a nearly-identical-as-possible recoding of str->int, in the comment starting with:
Here is a _str_to_long_iner implementation in the same exact style as the one Tim just posted.
IMO, that should be adopted.
I don't know about ContextVar
s, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.
Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.
I don't know about
ContextVar
s, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.
They seem quite similar to dynamic scoped variables as in Lisp. I.e. crawl up stack to find most recent binding. The mechanics are different but the behavior seems similar to me. We can forget about that stuff for this PR, I think.
Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.
It is 4300 decimal digits and it does seem low as default to me too.
./python -m timeit -s "s='1'*4000" "int(s)"
5000 loops, best of 5: 78.2 usec per loop
./python -m timeit -s "s='1'*4000; i = int(s)" "str(i)"
2000 loops, best of 5: 162 usec per loop
I don't see a sane way to apply
decimal
in the decimal string -> binary int direction.
I have one that seems faster than this module's str_to_int
starting around 10^7 digits. At 10^8 digits it's about three times faster (on Python 3.8.12). Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)
I have one that seems faster than this module's
str_to_int
starting around 10^7 digits. At 10^8 digits it's about three times faster (on Python 3.8.12). Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)
I think starting a discussion in Python Ideas (discuss.python.org) or creating a draft PR would be fine. I'm not completely opposed to discussing it here. I just don't want this PR to fall into the tarpit of endlessly debating further algorithm refinements.
Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)
Right, this pull request is about the specific code Neil is working on (in the "Files changed" tab). The linked issue report is for general discussion, although this UI buries the link at the bottom of the initial message (so it's hard to realize it's there unless you know in advance where to look for it). Here it is: gh-90716
I don't intend to argue with Neil :wink:, but it's my original issue report, and I wholly intended it to be open to discuss any and all relevant algorithms. This pull request is Neil's, though, and is about the specific algorithms he's working on. If a different pull request ends up implementing different algorithms, that's fine - while it's uncommon, there's nothing unprecedented, or "wrong", about having multiple PRs associated with a single issue report.
Ok, posted in gh-90716.
FYI, as of 3.11 we "freeze" most of the [pure Python] modules that are imported during startup. (See gh-89183.) The idea is that we minimize the amount of disk I/O and CPU time needed to import those modules, by pre-running as much of the import process as possible at build time and storing the result in the DATA section of the python binary. This speeds up startup significantly. It also allows us to import those modules much earlier during runtime initialization since we don't rely on OS functionality.
Basically, the module's code object is converted (as generated code) to C static variables, statically initialized, and then compiled into the binary (like builtin modules are). (That's the newer "deepfreeze". There's a simpler, older "freeze" mechanism that deepfreeze builds on.) Tools/scripts.freeze_modules.py is how we manage the frozen modules (and add freezing as a build step). Python/deepfreeze/deepfreeze.c is the generated code. To add a module to the set of those that get frozen, add the name to the FROZEN
dict at the top of Tools/scripts.freeze_modules.py.
As I said, a main benefit of freezing is that it helps lower startup time. That doesn't seem so relevant here. However, I'd be surprised if coupling a builtin type to a Python module would not benefit from the module being frozen.
Anyway, I just want to point this out in case it's useful. (It's a relatively new thing we're doing.)
Benchmark script extracted from bm_pidigits.py, can be run as stand-alone script.
Another benchmark, try to determine good cutoff to switch to int_divmod
. I'm using (size_w > 300 && (size_v - size_w) > 300)
now which seems better.