Add nmod poly divrem with divisor $x^n - c$
This adds functions for computing the quotient and remainder in the division by $x^n - c$. Special cases $c = -1$, $c = 0$, $c = 1$ are handled more efficiently with specific code, and for general $c$ this uses precomputations on $c$ if the modulus allows it.
Tests, documentation, profiling files updated.
Excerpt of profile output on zen4 and meteor lake (full output in file below) :
profiled: 1. divrem 2. new, poly interface 3. new, vec general 4. new, precomp 5. new, precomp_lazy
note: variant 2. expected to be faster than variants 3.4.5. when c == 1 or c == -1
zen4 AMD Ryzen 7 PRO 7840U 5.1GHz meteor lake Intel Core Ultra 7 165H 5.0GHz
bit c len n 1. 2. 3. 4. 5. 1. 2. 3. 4. 5.
## c == 1 | modulus x**n - 1
62 0 5 1 2.1e+02 1.5e+02 1.1e+02 8.7e+01 4.7e+01 1.9e+02 1.6e+02 1.3e+02 1.1e+02 4.2e+01
62 0 40 5 2.6e+03 1.7e+02 6.5e+02 5.1e+02 1.1e+02 2.7e+03 1.5e+02 6.1e+02 4.9e+02 8.2e+01
62 0 160 25 1.0e+04 3.2e+02 2.7e+03 1.9e+03 3.3e+02 9.1e+03 2.7e+02 2.4e+03 2.0e+03 2.7e+02
62 0 2560 25 1.8e+05 3.7e+03 4.9e+04 3.4e+04 5.5e+03 1.5e+05 3.6e+03 4.4e+04 3.5e+04 5.0e+03
62 0 2560 625 7.3e+05 2.4e+03 3.6e+04 2.5e+04 4.0e+03 7.3e+05 3.0e+03 3.3e+04 2.6e+04 3.9e+03
62 0 20480 25 1.5e+06 2.9e+04 3.9e+05 2.8e+05 4.4e+04 1.2e+06 2.8e+04 3.6e+05 2.9e+05 4.1e+04
62 0 20480 3125 6.7e+06 2.2e+04 3.3e+05 2.3e+05 3.6e+04 6.9e+06 2.9e+04 3.0e+05 2.4e+05 3.4e+04
62 0 327680 25 2.3e+07 4.9e+05 6.3e+06 4.5e+06 7.1e+05 2.1e+07 6.2e+05 5.8e+06 4.6e+06 6.7e+05
62 0 327680 3125 1.1e+08 3.9e+05 6.1e+06 4.3e+06 6.7e+05 1.2e+08 6.4e+05 5.6e+06 4.4e+06 6.3e+05
## c general | modulus x**n - c
62 2 5 1 2.1e+02 1.2e+02 1.2e+02 9.3e+01 4.7e+01 1.9e+02 1.2e+02 1.4e+02 1.2e+02 4.1e+01
62 2 40 5 2.6e+03 2.0e+02 7.3e+02 5.5e+02 1.1e+02 2.6e+03 1.9e+02 6.8e+02 5.6e+02 8.1e+01
62 2 160 25 1.1e+04 5.6e+02 2.7e+03 1.9e+03 3.2e+02 9.2e+03 5.8e+02 2.5e+03 2.0e+03 2.7e+02
62 2 2560 25 1.8e+05 6.9e+03 4.9e+04 3.4e+04 5.5e+03 1.5e+05 7.1e+03 4.4e+04 3.6e+04 5.0e+03
62 2 2560 625 7.2e+05 5.6e+03 3.7e+04 2.5e+04 4.0e+03 7.4e+05 6.1e+03 3.3e+04 2.6e+04 3.8e+03
62 2 20480 25 1.5e+06 5.3e+04 4.0e+05 2.8e+05 4.4e+04 1.2e+06 5.5e+04 3.6e+05 2.9e+05 3.9e+04
62 2 20480 3125 6.6e+06 4.7e+04 3.3e+05 2.3e+05 3.6e+04 6.7e+06 5.3e+04 2.9e+05 2.3e+05 3.3e+04
62 2 327680 25 2.3e+07 8.7e+05 6.4e+06 4.5e+06 7.2e+05 2.0e+07 9.4e+05 5.8e+06 4.6e+06 6.7e+05
62 2 327680 3125 1.1e+08 8.4e+05 6.2e+06 4.3e+06 6.8e+05 1.2e+08 9.1e+05 5.6e+06 4.4e+06 6.4e+05
A bit unusual for this code is that it does a lot of logic in the non-underscore method.
Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the fq_nmod context and reimplemented as a hack (with the precomputation done inline) in _nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.
A bit unusual for this code is that it does a lot of logic in the non-underscore method.
Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the
fq_nmodcontext and reimplemented as a hack (with the precomputation done inline) in_nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.
This question --- fast remainder on division by a binomial $x^n - c$ --- arose while working on the draft PR on n_fft. There I only need the underscore lazy variant, and maybe the $x^n - 1$ variant. Also, no need for the quotient there, but there the remainder is computed in-place (erasing the input polynomial) in which case the quotient appears during the computation anyway.
So my main wish is to have good performance of the underscore lazy variant (which probably means keeping this one specific to binomials). For the rest, I thought this was a fairly classical operation to work on univariate polynomials modulo $x^n - 1$ or $x^n + 1$, even when the modulus is not prime, but I actually have no use in mind for the general non-underscore variant. It could indeed be modified/discarded to provide this via a more general sparse division (I will have a look at the pointers you give).
A bit unusual for this code is that it does a lot of logic in the non-underscore method.
Does this functionality actually really make sense as a standalone method, or could it be a special case of a method for dividing by a sparse polynomial after a precomputation? This is currently used in the
fq_nmodcontext and reimplemented as a hack (with the precomputation done inline) in_nmod_poly_divrem_newton_n_preinv. One could very easily put in a branch to check for monic binomials, among with other special cases like monic trinomials. I thought about putting in Shoup precomputations and lazy reduction there too but didn't get around to it. It would be nice to do this systematically for other sparse polynomials. To clarify, I have no objection to the standalone function proposed in this PR, just asking if it doesn't make sense to generalise it right away.
I've just had a look at _nmod_poly_divrem_try_sparse . As you wrote, there is clear overlap in this PR with what is there, so this should be worked on before any kind of merge. I will put this PR in draft for now and handle this later.
Interestingly, for monic binomials, the code in try_sparse is slightly faster than the code in this PR when Shoup precomputations are not possible. I haven't checked why, but this is not due to temp allocations alone. Possibly the mix of while and for loop in the variant I wrote prevents the compiler from unrolling loops nicely. The roadmap for this PR could then become something like this, to get the best of both pieces of code:
- [ ] make sure performance (without precomputations) is at least as good as
try_sparse, - [ ] allow sparse monic polynomials with more than two monomials (like
try_sparse), - [ ] incorporate some Shoup precomputations / laziness when the modulus permits,
- [ ] expose this as some
divrem_sparseandrem_sparsefunction? - [ ] (... and don't forget to write documentation)
By the way, in case this rings any bell of other places where this may be useful/existing in flint: for n_fft things I will also need the "transposed" operation of remainder mod $x^n - c$, which consists in extending a linearly recurrent sequence with recurrence $x^n - c$ from its first $n$ terms $u_0,\ldots,u_{n-1}$ (in this binomial case, this is just multiplying these terms by suitable powers of $c$).