AbstractAlgebra.jl exp should not use slow Newton iteration

This performance bug affects both relative and absolute power series. AA implements two algorithms for power series exponentials: the n^2/2 basecase algorithm is used for series over rings while Newton iteration is used for series over fields, which is O(M(n)) with a large implied constant. Newton iteration is only going to win in power series rings with subquadratic multiplication and division, and then only in huge precision.

Compare (exp2 is the basecase algorithm):

julia> function Base.exp2(a::AbstractAlgebra.RelSeriesElem{T}) where {T <: FieldElement}
          if iszero(a)
             z = one(parent(a))
             z = set_precision!(z, precision(a))
             return z
          end
          vala = valuation(a)
          preca = precision(a)
          z = parent(a)()
          R = base_ring(a)
          fit!(z, preca)
          z = set_precision!(z, preca)
          z = set_valuation!(z, 0)
          c = vala == 0 ? polcoeff(a, 0) : R()
          z = setcoeff!(z, 0, AbstractAlgebra.exp(c))
          len = pol_length(a) + vala
          for k = 1 : preca - 1
             s = R()
             for j = 1 : min(k + 1, len) - 1
                c = j >= vala ? polcoeff(a, j - vala) : R()
                s += j * c * polcoeff(z, k - j)
             end
             !isunit(R(k)) && error("Unable to divide in exp")
             z = setcoeff!(z, k, divexact(s, k))
          end
          z = set_length!(z, normalise(z, preca))
          return z
       end

julia> R, x = PowerSeriesRing(QQ, 100, "x")
(Univariate power series ring in x over Rationals, x + O(x^101))

<only showing fastest timings>

julia> @time exp(x);
  0.058922 seconds (947.00 k allocations: 19.931 MiB)

julia> @time exp2(x);
  0.000426 seconds (8.06 k allocations: 162.375 KiB)

julia> @time exp(exp(x)-1);
  0.269644 seconds (1.92 M allocations: 41.292 MiB, 14.27% gc time)

julia> @time exp2(exp2(x)-1);
  0.026014 seconds (299.07 k allocations: 6.174 MiB)

julia> C, a = PolynomialRing(QQ, "a")
(Univariate Polynomial Ring in a over Rationals, a)

julia> R, x = PowerSeriesRing(FractionField(C), 15, "x")
(Univariate power series ring in x over Fraction field of Univariate Polynomial Ring in a over Rationals, x + O(x^16))

julia> @time exp(exp(x * (1 // a) + x^2 * 1 // (1 + a)) - 1);
 68.153794 seconds (10.05 M allocations: 54.211 GiB, 2.32% gc time)

julia> @time exp2(exp2(x * (1 // a) + x^2 * 1 // (1 + a)) - 1);
  0.935616 seconds (2.71 M allocations: 232.746 MiB, 4.94% gc time)

By the way, is there a good reason why the following doesn't work:

julia> x * (1 // a)
1//a*x + O(x^16)

julia> x // a
ERROR: MethodError: no method matching //(::AbstractAlgebra.Generic.RelSeries{AbstractAlgebra.Generic.Frac{AbstractAlgebra.Generic.Poly{Rational{BigInt}}}}, ::AbstractAlgebra.Generic.Poly{Rational{BigInt}})
Closest candidates are:
  //(::T, ::FracElem{T}) where T<:RingElem at /home/fredrik/.julia/dev/AbstractAlgebra/src/generic/Fraction.jl:82
  //(::FracElem{T}, ::T) where T<:RingElem at /home/fredrik/.julia/dev/AbstractAlgebra/src/generic/Fraction.jl:84
  //(::T, ::T) where T<:RingElem at /home/fredrik/.julia/dev/AbstractAlgebra/src/generic/Fraction.jl:70
  ...
Stacktrace:
 [1] top-level scope at REPL[128]:1

Mar 20 '21 12:03 fredrik-johansson

On a more basic level, I believe this also applies to inversion/division.

Mar 20 '21 12:03 fredrik-johansson

I believe that the idea is that we will make the series generic over poly rings and then implement subquadratic multiplication for generic polynomials.

You may still be right about the performance even then. But at least the intention is/was to speed this up using Karatsuba say in the generic case and of course much better for all Flint polynomials.

Mar 20 '21 14:03 wbhart

As for the // operator, no that's just a missing function. There should be something generic for this in Fields.jl probably, but no one added it yet.

Mar 20 '21 14:03 wbhart

Sure, but right now the Newton version doesn't make much sense; it's not a trivial difference in speed. Even with Karatsuba multiplication, there's going to be a cutoff (greater than the Karatsuba cutoff) where Newton wins, and even then the Newton iteration will be faster started from the basecase cutoff and not from 1. The basecase is also superior when the input is short (the complexity is actually O(nm) where m is the length of the input).

Flint types are not really relevant because the Flint polynomial types have their own exp_series methods (or should have, in case any are missing).

Making the series generic over polynomial rings would be nice; I think I'd like to have most basic series algorithms implemented for polynomials, with series types more concerned just with tracking precision. The code duplication between RelSeries and AbsSeries is a bit annoying.

Mar 20 '21 15:03 fredrik-johansson

I don't think we'll be able to fix the duplication between RelSeries and AbsSeries, we just won't need the Nemo series modules in their current form. Or something like that.

Mar 20 '21 15:03 wbhart

I think the idea with the iteration was that it was done with the idea that faster polynomial multiplication would be implemented. However, faster polynomial multiplication is tricky because it depends on functions for determining cutoffs. Furthermore, if we now have two algorithms for exp, log, sin, cos, ..., then each of these needs functions for determining cutoffs...

Mar 22 '21 11:03 tthsqe12

We should add a function that returns the threshold N for using subquadratic multiplication for a given ring. Then inv, div, sqrt, exp, log, sin, cos, ... could all use the basecase algorithm when prec < 2N (or where relevant, min(prec, inputlen) < 2N) and otherwise use Newton iteration down to the basecase range (not to 1!). The factor 2 could be set to some other reasonable constant in each function, while ring-dependent tuning of N needs to be done only for multiplication.

This kind of tuning is never going to be perfect, but it should provide a reasonably smooth transition between optimal basecase behavior and optimal asymptotic behavior.

Mar 22 '21 11:03 fredrik-johansson

That sounds like a reasonable way to keep the tuning under control @fredrik-johansson

In the mean time I am going to benchmark the two existing exp and inv functions with number fields, which is presumably the Hecke application (the newton iteration code was taken from there). If it is actually slower than what we had before I'll remove it pending additional work and open a ticket for it.

Mar 23 '21 08:03 wbhart

Yeah, the same conclusions do not apply for degree 10 number fields:

julia> using Nemo

julia> import AbstractAlgebra

julia> function Base.exp2(a::AbstractAlgebra.RelSeriesElem{T}) where {T <: FieldElement}
                 if iszero(a)
                    z = one(parent(a))
                    z = set_precision!(z, precision(a))
                    return z
                 end
                 vala = valuation(a)
                 preca = precision(a)
                 z = parent(a)()
                 R = base_ring(a)
                 fit!(z, preca)
                 z = set_precision!(z, preca)
                 z = set_valuation!(z, 0)
                 c = vala == 0 ? polcoeff(a, 0) : R()
                 z = setcoeff!(z, 0, AbstractAlgebra.exp(c))
                 len = pol_length(a) + vala
                 for k = 1 : preca - 1
                    s = R()
                    for j = 1 : min(k + 1, len) - 1
                       c = j >= vala ? polcoeff(a, j - vala) : R()
                       s += j * c * polcoeff(z, k - j)
                    end
                    !isunit(R(k)) && error("Unable to divide in exp")
                    z = setcoeff!(z, k, divexact(s, k))
                 end
                 z = set_length!(z, normalise(z, preca))
                 return z
              end

julia> R, x = PolynomialRing(QQ, "x")
(Univariate Polynomial Ring in x over Rational Field, x)

julia> S, a = NumberField(x^10 + 3x + 1, "a")
(Number field over Rational Field with defining polynomial x^10 + 3*x + 1, a)

julia> T, y = PowerSeriesRing(S, 100, "y")
(Univariate power series ring in y over S, y + O(y^101))

julia> f = rand(T, 100:100, -10:10);

julia> @time exp(f);
  0.000318 seconds (4.24 k allocations: 469.047 KiB)

julia> @time exp2(f);
  0.054354 seconds (194.69 k allocations: 18.293 MiB, 20.42% gc time)

The times are after running once for jit.

So I can see why Hecke added this code. Suggestions?

Mar 23 '21 08:03 wbhart

Could we still do ring-dependent tuning of the cutoff parameters?

Mar 24 '21 07:03 thofma

I guess so, but the issue here is no fast multiplication. For number fields we just have generic stuff. There's not much point tuning for the wrong thing.

Mar 24 '21 08:03 wbhart

please don't forget about #769

Mar 24 '21 11:03 tthsqe12

I can confirm your timings: I get a crossover at length 6 with your example.

Part of the problem might be that the basecase Newton code performs arithmetic with reductions. The inner loop ought to use nored operations (it would be even cleaner to express such accumulation loops using dot product functions, which can be optimized for each ring.

Mar 24 '21 12:03 fredrik-johansson

I will rewrite it to use delayed reduction. We have a generic mechanism for this now and it does support number fields.

Mar 24 '21 12:03 wbhart

@tthsqe12 What is the conclusion with that? Does it significantly slow down the Jit? Is it worth doing? We really need per ring cutoffs to make it worthwhile. It also seems to need a rebase.

Mar 24 '21 12:03 wbhart

I timed it with delayed reduction in exp2. The times go down, but nowhere near enough. It is still 20 times slower than the exp from Hecke.

julia> @time exp2(f);
  0.004758 seconds (81.40 k allocations: 8.386 MiB)

julia> @time exp(f);
  0.000214 seconds (4.24 k allocations: 469.047 KiB)

Mar 24 '21 15:03 wbhart

It's possible that the extra allocations are the issue. But I'm not sure I see how to remove them.

The problem is likely the pesky scalar multiplication by j.

Mar 24 '21 15:03 wbhart

Well, I got rid of that problem, but the result is still 10x too slow:

julia> @time exp2(f);
  0.002845 seconds (42.21 k allocations: 4.349 MiB)

julia> @time exp(f);
  0.000210 seconds (4.24 k allocations: 469.047 KiB)

Mar 24 '21 16:03 wbhart

I think the issue now is that getting a coefficient creates a new object. There's nothing we can do about this.

Basically what is true in Flint/C is just untrue here. If we had polynomials over number fields in C we'd probably be drawing different conclusions.

Edit: actually, even if we had such polynomials in C, we'd still incur the cost of getting coefficients in generic series code, so I'm not really sure if that is an explanation or not.

Mar 24 '21 16:03 wbhart

By the way, the Newton algorithm can also be improved by doing the logarithm Newton iteration simultaneously (see nmod_poly_exp_series).

Mar 25 '21 10:03 fredrik-johansson

@wbhart I think your benchmark is bogus.

This

julia> f = rand(T, 100:100, -10:10);

creates a power series with valuation 100. Granted, this is something we might want to optimize for separately, but it's hardly typical input. If I just do

julia> f = sum(rand(S, -10:10) * y^k for k = 1:99);

I get more reasonable results:

julia> @time exp(f);
  0.929874 seconds (529.46 k allocations: 77.917 MiB, 1.23% gc time)

julia> @time exp(f);
  0.573891 seconds (203.04 k allocations: 61.665 MiB, 0.45% gc time)

julia> @time exp(f);
  0.568137 seconds (199.38 k allocations: 61.650 MiB, 0.39% gc time)

julia> @time exp2(f);
  0.259666 seconds (402.71 k allocations: 30.962 MiB, 1.07% gc time)

julia> @time exp2(f);
  0.237256 seconds (321.52 k allocations: 27.197 MiB, 1.45% gc time)

julia> @time exp2(f);
  0.218811 seconds (218.61 k allocations: 21.015 MiB)

So here exp2 is about twice as fast as exp.

May 08 '21 15:05 fredrik-johansson

Yes, you are right. By the way, I forgot to append the version with delayed reduction and other improvements I made:

function Base.exp2(a::AbstractAlgebra.RelSeriesElem{T}) where {T <: FieldElement}
                 if iszero(a)
                    z = one(parent(a))
                    z = set_precision!(z, precision(a))
                    return z
                 end
                 vala = valuation(a)
                 preca = precision(a)
                 z = parent(a)()
                 R = base_ring(a)
                 fit!(z, preca)
                 z = set_precision!(z, preca)
                 z = set_valuation!(z, 0)
                 c = vala == 0 ? polcoeff(a, 0) : R()
                 z = setcoeff!(z, 0, AbstractAlgebra.exp(c))
                 len = pol_length(a) + vala;C = R()
                 d = derivative(a)
                 vald = valuation(d)
                 for k = 1 : preca - 1
                    s = R()
                    for j = 1 : min(k + 1, len) - 1
                       c = j >= vala ? polcoeff(d, j - vald) : R()
                       s = addmul_delayed_reduction!(s, c, polcoeff(z, k - j), C)
                    end
                    s = reduce!(s)
                    !isunit(R(k)) && error("Unable to divide in exp")
                    z = setcoeff!(z, k, divexact(s, k))
                 end
                 z = set_length!(z, normalise(z, preca))
                 return z
             end

May 12 '21 07:05 wbhart

Here are the times I get on my machine:

julia> f = sum(rand(S, -10:10) * y^k for k = 1:99);

julia> @time exp(f);
  0.578332 seconds (570.79 k allocations: 80.046 MiB, 4.08% gc time)

julia> @time exp(f);
  0.360110 seconds (208.19 k allocations: 62.241 MiB, 15.71% gc time)

julia> @time exp(f);
  0.344125 seconds (201.34 k allocations: 61.936 MiB, 11.98% gc time)

julia> @time exp(f);
  0.304410 seconds (190.86 k allocations: 61.677 MiB, 0.62% gc time)

julia> @time exp2(f);
  0.097381 seconds (107.68 k allocations: 15.099 MiB)

julia> @time exp2(f);
  0.074569 seconds (60.53 k allocations: 12.797 MiB, 2.31% gc time)

julia> @time exp2(f);
  0.072458 seconds (59.88 k allocations: 12.758 MiB)

julia> @time exp2(f);
  0.073720 seconds (59.89 k allocations: 12.769 MiB)

So I agree with your conclusions. So exp should not use slow Newton iteration. Of course we should still time everything when karatsuba is in use. I plan to work on this issue at some point fairly soon.

May 12 '21 07:05 wbhart

I have now implemented Newton iteration in Calcium, where I have KS multiplication over number fields. Based on some rough benchmarks, I have enabled it for n > d where d is the degree of the number field. Timings with random coeffs +/- 10 and random denominator 1...10:

Q(i) (degree 2)

n       basecase      new (basecase + newton)
10      1.44e-05    1.48e-05
20      0.000173    9.22e-05
40      0.0013      0.000397
80      0.0154      0.00217
160     0.133       0.00814
320     1.408       0.0386
640     13.93       0.207
1280    160.246     1.171

Q(zeta_28) (degree 12)

10      0.000106    0.000107
20      0.000859    0.000791
40      0.00618     0.00421
80      0.0316      0.0138
160     0.334       0.075
320     1.538       0.253
640     16.867      1.701
1280    165.483     8.368

Q(zeta_111) (degree 72)

10      0.000953    0.000933
20      0.013       0.0137
40      0.098       0.1
80      0.711       0.258
160     4.885       0.921
320     37.128      4.946
640     235.359     17.951

So Newton is (unsurprisingly) fast when the multiplication is fast.

Not sure what the situation will be like with Karatsuba alone.

May 12 '21 12:05 fredrik-johansson

I think we also have a generic KS knocking around somewhere, so I'll check to see.

May 12 '21 13:05 wbhart