cpython gh-90716: add _pylong.py module

Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API.

Currently implemented is int_to_decimal_string() (based on code from Tim), int_divmod() (based on code from Mark), and str_to_int() (based on code from bjorn-martinsson). The longobject.c module has been changed to call into _pylong if the Python version is likely to be faster.

Examples of performance improvements:

int->str

python -m timeit -s "i=(1<<2_000_000) - 1" "str(i)" 
Before: 1 loop, best of 5: 3.58 sec per loop
After: 2 loops, best of 5: 168 msec per loop

str->int

python -m timeit -s "s='1'*1_000_000" "int(s)"
Before: 1 loop, best of 5: 3.65 sec per loop
After: 1 loop, best of 5: 470 msec per loop

divmod

python -m timeit -s "a=(1<<20_000_000)-1; b=(1<<100_000)" "divmod(a, b)"
Before: 1 loop, best of 5: 2.23 sec per loop
After: 2 loops, best of 5: 141 msec per loop

Co-author: Tim Peters [email protected] Co-author: Mark Dickinson [email protected] Co-author: Bjorn Martinsson

Issue: gh-90716

Sep 08 '22 06:09 nascheme

Some relevant discussion on the pros and cons of adding these "smarter" but more complicated algorithms:

https://discuss.python.org/t/faster-large-integer-multiplication/13300

I think implementing them in Python lowers the bar quite a bit in terms of what is maintainable in the long term. We have a small number of core devs who are numeric experts so we need to be cautious about adding a bunch of complex code. Maybe this PR already goes too far? It feels like pretty good "bang for buck" to me.

Sep 08 '22 07:09 nascheme

About long_to_decimal, here are some things that I've noticed/thoughts.

Increasing the size of BITLIM to something like 32768 gives a noticable performance boost for me.
I think it is more natural to add this algorithm to the decimal module. This code is essentially just a smart way of implementing the constructor of decimal.Decimal when called with an int.
The code is really slow in PyPy. I'm guessing this is because decimal is slow in PyPy.
I think the current implementation is a bit hard to understand. Maybe something like this would make it easier to follow:

D = decimal.Decimal
mem = {1: D(2)}
def w2pow(w):
    """ return D(2)**w and store the result """
    if w not in mem:
        if w - 1 in mem:
            mem[w] = mem[w - 1] + mem[w - 1]
        else:
            w2 = w >> 1
            mem[w] = w2pow(w2) * w2pow(w - w2)
    return mem[w]

BITLIM = 32768
def inner(n, w):
    if w <= BITLIM:
        return D(n)
    w2 = w >> 1
    hi = n >> w2
    lo = n - (hi << w2)
    return inner(hi, w - w2) * w2pow(w2) + inner(lo, w2)

with decimal.localcontext() as ctx:
    ctx.prec = decimal.MAX_PREC
    ctx.Emax = decimal.MAX_EMAX
    ctx.Emin = decimal.MIN_EMIN
    ctx.traps[decimal.Inexact] = 1
    
    result = inner(n, n.bit_length())

Btw there is an if 0: in divmod_fast which I'm assuming is not supposed to be there.

Sep 08 '22 21:09 bjorn-martinsson

No thought whatsoever, or timing, went into the current BITLIM value. I see that, on my disk, I last left it as 3 :wink:. So as to provoke the code into doing something on "almost all" inputs, for testing.
The code doesn't belong in decimal. Leaving aside that decimal is an external project (it's not "our" code), if this experiment is accepted it's intended to be a repository for all coded-in-Python bigint accelerators. If the Decimal() constructor wants to exploit it (which it should!), fine, CPython's decimal wrapper should import this module and call its long_to_decimal() function as needed.
The original code aimed to build an efficient power tree as efficiently as reasonably possible. Toward that end, it looked at each plausible entry only once, didn't use recursion, and aimed to minimize the number of dict entries. For example, for bit width 1,000,001 and BITLIM=128, it creates a dict with 16 keys, the smallest of which is 122. Your alternative creates a dict with 27 keys. At your BITLIM=32768, the same for your code's dict, but the original only constructs 6 dict entries. Is that worth a lot? Nope. But since I don't find it hard to understand, I'm not inclined to change it either. Neil can do whatever he likes - fine by me.
In any case, a persistent dict is a Bad Idea here. It will only grow over time, to no good end. Regardless of which dict-building code is used, it needs only O(log log n) entries, so there's scant overlap expected between different top-level arguments. In your code much more so, since it creates a chain of entries all the way down to key 1. The original code never creates more than 2 entries with key < BITLIM.
if 0: is a common way to "comment out" a block of executable code intended for, e.g., debugging. No code is generated for the block; the compiler throws away the if 0 test - and the block - entirely. Nothing wrong with leaving such code in.
I was surprised to see that x + x really is generally faster than x*2 in decimal - good catch 😃

Sep 09 '22 00:09 tim-one

I agree that persistent dictionaries are bad, but I never intended mem to be one.

About the number of keys used. I fixed my implementation to use fewer keys. Now I believe it uses the same number of keys, or sometimes 1 less key than your implementation. For example, try it out with bit width 1,000,001. For BITLIM=128 I use 15 keys, and your implementation uses 16.

mem = {}
def w2pow(w):
    """ return D(2)**w and store the result """
    if w not in mem:
        if w - 1 in mem:
            mem[w] = mem[w - 1] + mem[w - 1]
        elif w <= BITLIM:
            mem[w] = D2 ** w
        else:
            w2 = w >> 1
            mem[w] = w2pow(w2) * w2pow(w - w2)
    return mem[w]

BITLIM = 32768 # 128
def inner(n, w):
    if w <= BITLIM:
        return D(n)
    w2 = w >> 1
    hi = n >> w2
    lo = n - (hi << w2)
    return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)

with decimal.localcontext() as ctx:
    ctx.prec = decimal.MAX_PREC
    ctx.Emax = decimal.MAX_EMAX
    ctx.Emin = decimal.MIN_EMIN
    ctx.traps[decimal.Inexact] = 1
    
    result = inner(n, n.bit_length())

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

Sep 09 '22 01:09 bjorn-martinsson

I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

That seems wrong. Trace through, e.g., w2pow(2**20). That invokes

mem[w] = w2pow(w2) * w2pow(w - w2)

which makes a recursive call with 2**19, which in turn makes a recursive call on the same line with 2**18, and so on, until the function is finally called with 2**15 (your BITLIM value). Only then does the recursion start to unwind.

Sep 09 '22 02:09 tim-one

Hi bjorn-martinsson, I appreciate the interest in improving the functions in _pylong.py but polishing the functions is not the primary focus of this PR. Perhaps we should create another issue to focus on that? If they produce correct results and perform reasonably well, I think that's good enough for now. So, I feel like long_to_decimal() is in pretty okay shape already.

I'm less confident about divmod_fast() since Mark has not looked at this PR yet, AFAIK. I extracted the code from an old issue and I believe it was some prototype code, not intended for serious use.

Writing more extensive unit tests and scripts to produce timing/benchmark reports is perhaps the next step. long_to_decimal doesn't handle the bytes writer case, need to fix that. Need to come up with better threshold tests of when to switch to the _pylong functions. I'm using Py_ABS(Py_SIZE(n)) > 1000 but that's almost certainly not the right thing.

Based on comments from njs, I wonder if _pylong should use ContextVar and then we could move the max_str_digits checking into there. Getting away from a global interpreter setting would be quite a nice improvement. Doing it inside _pylong should be a lot easier.

Sep 09 '22 02:09 nascheme

Neil, I think the one basic thing this is missing is an asymptotically faster str->int function. Conversions in both directions were the real topic of the original issue report, although it focused on int->str (as "the harder" problem). Both are quadratic-time today.

@bjorn-martinsson posted suitable code for str->int (in a comment) in gh-90716. It's much like the int->str here, but sticks to native Python int arithmetic so only gets "Karatsuba-like" speedup. Which is huge, but not as huge as we can get from decimal's fancier *. But I don't see a sane way to apply decimal in the decimal string -> binary int direction. The result has to be a CPython bigint, and conversion from Decimal to int is also quadratic-time(*).

(*) But could be faster after this PR lands (convert the Decimal to a decimal string (linear time in # of digits), then apply the hypothetical new faster-than-quadratic str->int function).

Sep 09 '22 03:09 tim-one

Some quick timing scripts. time_divmod_fast.py.txt time_long_to_dec.py.txt

Results on my machine:

time_divmod_fast.py
      bits     old time     new time   faster  correct
------------------------------------------------------------------------------
        54        0.000        0.000     0.18     True
       107        0.000        0.000     0.17     True
       213        0.000        0.000     0.17     True
       426        0.000        0.000     0.26     True
       851        0.000        0.000     0.62     True
      1701        0.000        0.000     0.24     True
      3402        0.000        0.000     0.39     True
      6804        0.000        0.000     0.56     True
     13607        0.000        0.000     1.06     True
     27214        0.001        0.000     1.90     True
     54427        0.002        0.001     2.31     True
    108853        0.008        0.002     3.97     True
    217706        0.026        0.005     5.26     True
    435412        0.105        0.014     7.40     True

time_long_to_dec.py
      bits      old      new   ratio equal
      1025    0.000    0.000    0.1x True
      2049    0.000    0.000    0.2x True
      4097    0.000    0.000    0.4x True
      8193    0.000    0.000    0.7x True
     16385    0.000    0.000    1.0x True
     32769    0.001    0.001    1.5x True
     65537    0.005    0.003    2.0x True
    131073    0.016    0.005    3.0x True
    262145    0.062    0.012    5.2x True
    524289    0.246    0.026    9.5x True
   1048577    0.980    0.054   18.0x True
   2097153    3.926    0.114   34.5x True

Sep 09 '22 03:09 nascheme

Neil, I think the one basic thing this is missing is an asymptotically faster str->int function.

Ah, okay. I extracted bjorn-martinsson's code from GH-90716. A little timing script is attached. Timing from my machine:

time_str_to_long.py

    digits     old time     new time   faster  correct
------------------------------------------------------------------------------
      1024        0.000        0.000     1.50     True
      2048        0.000        0.000     0.98     True
      4096        0.000        0.000     0.93     True
      8192        0.001        0.001     1.21     True
     16384        0.002        0.001     2.47     True
     32768        0.005        0.002     2.43     True
     65536        0.019        0.005     3.48     True
    131072        0.061        0.016     3.78     True
    262144        0.263        0.049     5.39     True
    524288        0.983        0.147     6.68     True

I'm working on a change to longobject.c to use it. The details are a bit tricky though (leading/trailing whitespace, etc).

time_str_to_long.py.txt

Sep 09 '22 04:09 nascheme

I'm not going to argue about this. As I said, I didn't find the code hard to understand to begin with, so this is just random thrashing to me. I expect you find your code easier to understand because you wrote it.

I just thought that the recursive formulation looked nicer and is more straight forward. But I agree that this is completely subjective.

With this implementation, w2pow() is called in smart order (it is called with small values of w first, then bigger values of w), making it only have recursion depth at most 2.

That seems wrong. Trace through, e.g., w2pow(2**20). That invokes
mem[w] = w2pow(w2) * w2pow(w - w2)
which makes a recursive call with 2**19, which in turn makes a recursive call on the same line with 2**18, and so on, until the function is finally called with 2**15 (your BITLIM value). Only then does the recursion start to unwind.

I still belive I'm correct, but the reason for why is a bit subtle. Notice that in my implementation, the first time inner() calls w2pow(w2) it does it with a small w2. So the first call will not lead to a 2^20 -> 2^19 -> 2^18 -> ... kind of recursion. In fact, the deepest it will ever go is depth 2.

Sep 09 '22 10:09 bjorn-martinsson

I've complied the three different styles for int -> str conversion and str -> int.

The first style is the one I used in the comment in gh-90716. It is very simple, but slightly slower than the other two styles. The second style is the recursive style with mem that I posted above. The third style is what Tim used.

I think any of these styles would work, just be consistent and use the same style for both str -> int and int -> str.

from_str_to_int.py

def str_to_int_only_pow2_powers(s):
    DIGLIM = 2048
    
    pow5 = [5]
    for _ in range((len(s) - 1).bit_length() - 1):
        pow5.append(pow5[-1] * pow5[-1])
    
    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        lg_split = (r - l - 1).bit_length() - 1
        split = 1 << lg_split
        return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
    return _str_to_int(0, len(s))


def str_to_int_recursive(s):
    DIGLIM = 2048
    
    mem = {}
    def w5pow(w):
        """ return 5**w and store the result """
        if w not in mem:
            if w - 1 in mem:
                mem[w] = mem[w - 1] * 5
            elif w <= DIGLIM:
                mem[w] = 5 ** w
            else:
                w2 = w >> 1
                mem[w] = w5pow(w2) * w5pow(w - w2)
        return mem[w]
    
    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        split = (r - l) >> 1
        return _str_to_int(r - split, r) + ((_str_to_int(l, r - split) * w5pow(split)) << split)
    return _str_to_int(0, len(s))

def str_to_int_iterative(s):
    DIGLIM = 2048

    w5pow = {}
    w = len(s)
    while w >= DIGLIM:
        w2 = w >> 1
        if w & 1:
            w5pow[w2 + 1] = None
        w5pow[w2] = None
        w = w2
    if w5pow:
        it = reversed(w5pow.keys())
        w = next(it)
        w5pow[w] = 5 ** w
        for w in it:
            if w - 1 in w5pow:
                val = w5pow[w - 1] * 5
            else:
                w2 = w >> 1
                val = w5pow[w2] * w5pow[w - w2]
            w5pow[w] = val   

    def _str_to_int(l, r):
        if r - l <= DIGLIM:
            return int(s[l:r])
        split = (r - l) >> 1
        return ((_str_to_int(l, r - split) * w5pow[split]) << split) + _str_to_int(r - split, r)
    return _str_to_int(0, len(s))

from_int_to_str.py

import decimal

def int_to_str_only_pow2_powers(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1
        
        pow2 = [D(2)]
        for _ in range((n.bit_length() - 1).bit_length() - 1):
            pow2.append(pow2[-1] * pow2[-1])
        
        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            lg_w2 = (w - 1).bit_length() - 1
            w2 = 1 << lg_w2
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(hi, w - w2) * pow2[lg_w2] + inner(lo, w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)


def int_to_str_recursive(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        mem = {}
        def w2pow(w):
            """ return D(2)**w and store the result """
            if w not in mem:
                if w - 1 in mem:
                    mem[w] = mem[w - 1] + mem[w - 1]
                elif w <= BITLIM:
                    mem[w] = D(2) ** w
                else:
                    w2 = w >> 1
                    mem[w] = w2pow(w2) * w2pow(w - w2)
            return mem[w]

        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            w2 = w >> 1
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)


def int_to_str_iterative(n):
    BITLIM = 32768
    
    if n < 0:
        negate = True
        n = -n
    else:
        negate = False
    
    D = decimal.Decimal
    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        w2pow = {}
        w = n.bit_length()
        while w >= BITLIM:
            w2 = w >> 1
            if w & 1:
                w2pow[w2 + 1] = None
            w2pow[w2] = None
            w = w2
        if w2pow:
            it = reversed(w2pow.keys())
            w = next(it)
            w2pow[w] = D(2)**w
            for w in it:
                if w - 1 in w2pow:
                    val = w2pow[w - 1] + w2pow[w - 1]
                else:
                    w2 = w >> 1
                    assert w2 in w2pow
                    assert w - w2 in w2pow
                    val = w2pow[w2] * w2pow[w - w2]
                w2pow[w] = val

        def inner(n, w):
            if w <= BITLIM:
                return D(n)
            w2 = w >> 1
            hi = n >> w2
            lo = n - (hi << w2)
            return inner(hi, w - w2) * w2pow[w2] + inner(lo, w2)
        result = inner(n, n.bit_length())
    
    if negate:
        result = -result
    return str(result)

About str_to_int needing a faster big int multiplication to be significantly faster than int(). I think that it is worth adding str_to_int even if no super fast big int multiplication currently exists in Python. If in the future Python adds faster big int multiplication, then str_to_int will run fast without needing to be modified in any way.

Finally, about implementing fast big int multiplication. I have not yet tried coding the Schönhage–Strassen algorithm in Python, but I'm fairly familiar with FFT algorithms and I believe I could get a pretty clean looking implementation using about 10 - 20 lines of Python code.

Sep 09 '22 11:09 bjorn-martinsson

I agree these should be as similar as possible. I don't really care which style is picked.

I have no interest in being as fast as possible here. "Major improvement" is the goal. "Just" Karatsuba-like speedup is, as said before, huge. Ambition has forever been the fatal enemy of "major improvement" here, and after 30 years of that I'm tired of it :wink:.

For a given value, int->str is currently way slower than str->int, which is why I focused on the former.

$ timeit -s "i = 10**100000 - 1; s = str(i)" "str(i)"
2 loops, best of 5: 160 msec per loop

$ timeit -s "i = 10**100000 - 1; s = str(i)" "int(s)"
5 loops, best of 5: 58.2 msec per loop

But improving the asymptotics of str->int is also of real value.

Sep 09 '22 17:09 tim-one

I've made some more updates, getting close to merge quality now. I would like some more unit tests yet and a good code review. My knowledge of the C-API is a bit out of date (e.g. PyUnicode_Writer, etc). I've renamed Mark's divmod_fast() to int_divmod, to be more consistent. I hope he might have time to give feedback on it (e.g. is it fit for purpose).

Ambition has forever been the fatal enemy of "major improvement" here

Yeah some improvement from the status quo is my main focus with this PR. I'm pretty sure people will come along and improve _pylong.py afterwards. Perhaps we will need to pushback to prevent it from growing a lot of complexity. However, even with the simple and short version I have now, it seems we are getting some decent speedups, if the integers are large enough.

Adding in a ContextVar based mechanism seems like neat idea but falls under the ambition category. Also, I think it could be worth considering making a semi-public API so packages like gmpy2 could monkey-patch _pylong. Probably they should be able to adjust the threshold levels for using _pylong as well, since those implementations have a different cutover performance point.

Sep 09 '22 23:09 nascheme

As I discussed before, I think it is important to be consistent with the style.

Currently int_to_decimal and _str_to_int_inner are coded in different ways even though they fundamentally do the exact same thing. So their implementations could and should look almost the same.

So therefore, I think you should switch _str_to_int_inner to the str_to_int_iterative() function from from_str_to_int.py that I posted above. Doing that will make int_to_decimal and _str_to_int_inner have the same style.

Sep 10 '22 00:09 bjorn-martinsson

I think it is important to be consistent with the style.

Okay. To my eye, the recursive version reads easier and seems to perform just as well. I made a gist with the iterative version: https://gist.github.com/nascheme/d04e6d63cfbd3e4614ca0e173b961de4 Is that correct? It is more lines of code but I do see how it is more similar to int_to_decimal.

Would you be able to create a recursive version of int_to_decimal? I'd like to see what it looks like. The pow5() seems quite nice to me. I'd assume you would need a similar pow2() function.

Sep 10 '22 00:09 nascheme

EDIT: oops! This originally forgot to add mem[w] = result to cache the result. No effect on correctness, just speed. The problem with using @functools.cache is that then the cache is "invisible", so we can't test whether w-1 is in it.

Would you be able to create a recursive version of int_to_decimal? I'd like to see what it looks like.

He already did, but hiding in a file in a comment that doesn't show up unless you click on the right thing. I'll pull it out here, but renamed, and reworked some to be more of a straight drop-in replacement for what you already have, and with some comments. I agree the recursive form reads well.

def int_to_decimal(n):
    """Asymptotically fast conversion of an 'int' to Decimal."""

    # Function due to Tim Peters.  See GH issue #90716 for details.
    # https://github.com/python/cpython/issues/90716
    #
    # The implementation in longobject.c of base conversion algorithms
    # between power-of-2 and non-power-of-2 bases are quadratic time.
    # This function implements a divide-and-conquer algorithm that is
    # faster for large numbers.  Builds an equal decimal.Decimal in a
    # "clever" recursive way.  If we want a string representation, we
    # apply str to _that_.

    if _DEBUG:
        print('int_to_decimal', n.bit_length(), file=sys.stderr)

    D = decimal.Decimal
    D2 = D(2)

    BITLIM = 32768

    mem = {}
    def w2pow(w):
        """Return D(2)**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to Decimal.
"""             
        if (result := mem.get(w)) is None:
            if w <= BITLIM:
                result = D2 ** w
            elif w - 1 in mem:
                result = (t := mem[w - 1]) + t
            else:
                w2 = w >> 1
                # If w happens to be odd, w-w2 is one
                # larger then w2 now. Recurse on the
                # smaller first (w2), so that it's in
                # the cache and the larger (w-w2) can
                # be handled by the cheaper `w-1 in mem`
                # branch instead.
                result = w2pow(w2) * w2pow(w - w2)
            mem[w] = result
        return result

    def inner(n, w):
        if w <= BITLIM:
            return D(n)
        w2 = w >> 1
        hi = n >> w2
        lo = n - (hi << w2)
        return inner(lo, w2) + inner(hi, w - w2) * w2pow(w2)

    with decimal.localcontext() as ctx:
        ctx.prec = decimal.MAX_PREC
        ctx.Emax = decimal.MAX_EMAX
        ctx.Emin = decimal.MIN_EMIN
        ctx.traps[decimal.Inexact] = 1

        if n < 0:
            negate = True
            n = -n
        else:
            negate = False
        result = inner(n, n.bit_length())
        if negate:
            result = -result

    return result

Sep 10 '22 02:09 tim-one

Here is a _str_to_long_inner implementation in the same exact style as the one Tim just posted.

def _str_to_long_inner(s):
    """Asymptotically fast conversion of a 'str' to an 'int'."""
    
    # Function due to Bjorn Martinsson.  See GH issue #90716 for details.
    # https://github.com/python/cpython/issues/90716
    #
    # The implementation in longobject.c of base conversion algorithms
    # between power-of-2 and non-power-of-2 bases are quadratic time. 
    # This function implements a divide-and-conquer algorithm making use 
    # of Python's built in big int multiplication. Since Python uses the
    # Karatsuba algorithm for multiplication, the time complexity
    # of this function is O(len(s)**1.58).
    
    DIGLIM = 2048

    mem = {}
    def w5pow(w):
        """Return 5**w and store the result.
Also possibly save some intermediate results. In context, these
are likely to be reused across various levels of the conversion
to 'int'.
"""             
        if (result := mem.get(w)) is None:
            if w <= DIGLIM:
                result = 5 ** w
            elif w - 1 in mem:
                result = mem[w - 1] * 5
            else:
                w2 = w >> 1
                # If w happens to be odd, w-w2 is one
                # larger then w2 now. Recurse on the
                # smaller first (w2), so that it's in
                # the cache and the larger (w-w2) can
                # be handled by the cheaper `w-1 in mem`
                # branch instead.
                result = w5pow(w2) * w5pow(w - w2)
            mem[w] = result
        return result
    
    def inner(a, b):
        if b - a <= DIGLIM:
            return int(s[a:b])
        mid = (a + b + 1) >> 1 
        return inner(mid, b) + ((inner(a, mid) * w5pow(b - mid)) << (b - mid))
    return inner(0, len(s))

Sep 10 '22 14:09 bjorn-martinsson

Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on str(int) conversion to protect against DoS attack?

Sep 10 '22 16:09 smichr

Is the goal here to make memory, not time, the limiting factor with hopes that there would be no need for a limit on str(int) conversion to protect against DoS attack?

That's not my goal, and I doubt it's Neil's either. The primary goal is to make decimal_str <-> int conversions asymptotically faster on large values, for the sake of speed. Their current quadratic-time behavior has been an ongoing background complaint for decades.

Sep 10 '22 17:09 tim-one

I feel this is ready for review now. Removing Mark as a requested reviewer since he mentioned he might not have time to do it, at least anytime soon.

I don't think there is any huge rush to get this in. I was hoping it might help reduce the pain of CVE-2020-10735 mitigations. However, the threshold for kicking into the _pylong functions is quite a bit higher than the default 4300 limit. Maybe with some more work we can get there. Also, implementing a ContextVar system in _pylong and then putting the int_max_str_digits check in there could be a good idea. That can be done as a separate PR though.

Sep 10 '22 22:09 nascheme

The int->str and str->int code are very similar in concept, and should be next to each other in the code (not with unrelated division between them).

While the concepts are very similar, the code is currently quite different in essentially pointless ways. Within the last day, @bjorn-martinsson posted a nearly-identical-as-possible recoding of str->int, in the comment starting with:

Here is a _str_to_long_iner implementation in the same exact style as the one Tim just posted.

IMO, that should be adopted.

I don't know about ContextVars, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.

Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.

Sep 10 '22 23:09 tim-one

I don't know about ContextVars, and a quick skim of the docs left me more confused about why they exist than before I looked ☹️.

They seem quite similar to dynamic scoped variables as in Lisp. I.e. crawl up stack to find most recent binding. The mechanics are different but the behavior seems similar to me. We can forget about that stuff for this PR, I think.

Regardless, whether 4300 is bits or decimal digits, it appears to me to be absurdly low. By eyeball, conversions of values of that size appear instantaneous today. Indeed, same thing by eyeball at 10x that size (43_000 bits or digits). At yet another 10x larger, then I see a perceptible delay.

It is 4300 decimal digits and it does seem low as default to me too.

./python -m timeit -s "s='1'*4000" "int(s)"
5000 loops, best of 5: 78.2 usec per loop
./python -m timeit -s "s='1'*4000; i = int(s)" "str(i)"
2000 loops, best of 5: 162 usec per loop

Sep 11 '22 00:09 nascheme

I don't see a sane way to apply decimal in the decimal string -> binary int direction.

I have one that seems faster than this module's str_to_int starting around 10^7 digits. At 10^8 digits it's about three times faster (on Python 3.8.12). Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)

Sep 11 '22 02:09 pochmann

I have one that seems faster than this module's str_to_int starting around 10^7 digits. At 10^8 digits it's about three times faster (on Python 3.8.12). Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)

I think starting a discussion in Python Ideas (discuss.python.org) or creating a draft PR would be fine. I'm not completely opposed to discussing it here. I just don't want this PR to fall into the tarpit of endlessly debating further algorithm refinements.

Sep 11 '22 02:09 nascheme

Where's a good place to show it? (If I understand correctly, this issue is not the place for further alternatives.)

Right, this pull request is about the specific code Neil is working on (in the "Files changed" tab). The linked issue report is for general discussion, although this UI buries the link at the bottom of the initial message (so it's hard to realize it's there unless you know in advance where to look for it). Here it is: gh-90716

Sep 11 '22 02:09 tim-one

I don't intend to argue with Neil :wink:, but it's my original issue report, and I wholly intended it to be open to discuss any and all relevant algorithms. This pull request is Neil's, though, and is about the specific algorithms he's working on. If a different pull request ends up implementing different algorithms, that's fine - while it's uncommon, there's nothing unprecedented, or "wrong", about having multiple PRs associated with a single issue report.

Sep 11 '22 03:09 tim-one

Ok, posted in gh-90716.

Sep 11 '22 03:09 pochmann

FYI, as of 3.11 we "freeze" most of the [pure Python] modules that are imported during startup. (See gh-89183.) The idea is that we minimize the amount of disk I/O and CPU time needed to import those modules, by pre-running as much of the import process as possible at build time and storing the result in the DATA section of the python binary. This speeds up startup significantly. It also allows us to import those modules much earlier during runtime initialization since we don't rely on OS functionality.

Basically, the module's code object is converted (as generated code) to C static variables, statically initialized, and then compiled into the binary (like builtin modules are). (That's the newer "deepfreeze". There's a simpler, older "freeze" mechanism that deepfreeze builds on.) Tools/scripts.freeze_modules.py is how we manage the frozen modules (and add freezing as a build step). Python/deepfreeze/deepfreeze.c is the generated code. To add a module to the set of those that get frozen, add the name to the FROZEN dict at the top of Tools/scripts.freeze_modules.py.

As I said, a main benefit of freezing is that it helps lower startup time. That doesn't seem so relevant here. However, I'd be surprised if coupling a builtin type to a Python module would not benefit from the module being frozen.

Anyway, I just want to point this out in case it's useful. (It's a relatively new thing we're doing.)

Sep 15 '22 16:09 ericsnowcurrently

Benchmark script extracted from bm_pidigits.py, can be run as stand-alone script.

bench_pidigits.py.txt

Sep 17 '22 20:09 nascheme

Another benchmark, try to determine good cutoff to switch to int_divmod. I'm using (size_w > 300 && (size_v - size_w) > 300) now which seems better.

bench_divmod.py.txt

Sep 17 '22 22:09 nascheme

cpython cpython copied to clipboard

gh-90716: add _pylong.py module

cpython
cpython copied to clipboard