Arb mulhigh
I don't think we should have different precisions on different systems (currently flint_mpn_mulhigh is only available on some x86-64 systems), but this is my very preliminary draft on how I want arb_mul to look like.
Left to fix:
- [ ] Make
flint_mpn_mulhighavailable on all systems. - [ ] Implement
_arb_mul_special - [ ] Make sure that the mag is has the correct bound --
arf_mul_rnd_sloppyrounds input in order to be able to perform a n-by-n high multiplication. So one has to account for rounding before the multiplication is done as well as the fact that the multiplication is inexact. - [ ] Discuss if we should account for the precision for the normal case (i.e. when
arf_mul_rnd_sloppyis used) -- should we round the result or not?
Yeah, I wouldn't do this before having a generic C fallback for and fixing the performance issues for flint_mpn_mulhigh / flint_mpn_mulhigh_normalised, so that these functions can be relied upon without conditional directives creeping into higher level code.
It will certainly not work to just take the shorter of the input lengths as the precision in arf_mul_rnd_sloppy. In general, the inputs can be shorter or longer than the precision, or a mixture. A bit of logic is needed for this. It will even be optimal to zero-pad operands in some cases.
Note that it is also necessary to normalise the arf_t output by removing trailing zero limbs.
I think this work would be aided by having a general flint_mpn_mulhigh which computes the top n (+ 1) limbs of an m x p product, m + p >= n (+ 1).
Such a method would be a bit complex, but it would offload complexity from the arf/arb side.
In general, isn't it true that an arb_t is on the form $x = a 2^{m} \pm b 2^{n}$ with $a, b, m, n \in \mathbb{Z}$ and $|b 2^{n}| > 2^{m}$? Or am I thinking wrong that only calculating an balanced high multiplication for unbalanced multiplicants will not give a so-much-bigger error bound?
mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function
ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);
which requires space for all an+bn limbs in x but is only responsible for writing [xlo,xhi). If the return is e, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.
mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function
ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);which requires space for all
an+bnlimbs in x but is only responsible for writing[xlo,xhi). If the return ise, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.
Yep, that should improve the division and square root code quite a bit.
mulhigh is already doing good things for nfloat though. I think I want to iron out all the basic algorithms there before trying to improve arf/arb.
mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function
ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);which requires space for all
an+bnlimbs in x but is only responsible for writing[xlo,xhi). If the return ise, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.
Depends on what Newton's method you are talking about. I believe that precomputed inverses and reciprocal square roots via Newton's method really favors mulhi.
mulhi is a special case of mulmid, and some of your mulhis for inversion don't need all of the high bits. It is even in the TODO.md:
nmod_poly
- Implement fast
mulmidand use to improve Newton iteration
fmpzs are just nmod_polys with carries, so the same principles apply.
You can see this in https://github.com/flintlib/flint/blob/80e9b245ca983bb40dc83c2a289302a50f02db16/src/gr_poly/inv_series_newton.c#L57 where the low m coefficients of the mullow output are never used.
(Of course high/low Newton divison are the same but reversed.)
For beyond 20M bits, I am seeing a ~30% improvement in arb_inv with this mulmid over what is currently in flint. (Still have to tune it in the medium range.)
For beyond 20M bits, I am seeing a ~30% improvement in arb_inv with this mulmid over what is currently in flint. (Still have to tune it in the medium range.)
That sounds excellent.
Of course, the current arb code is a placeholder implementation using arf arithmetic. For medium precision, having everything in mpn form should be a bit better.