flint Arb mulhigh

I don't think we should have different precisions on different systems (currently flint_mpn_mulhigh is only available on some x86-64 systems), but this is my very preliminary draft on how I want arb_mul to look like.

Left to fix:

[ ] Make flint_mpn_mulhigh available on all systems.
[ ] Implement _arb_mul_special
[ ] Make sure that the mag is has the correct bound -- arf_mul_rnd_sloppy rounds input in order to be able to perform a n-by-n high multiplication. So one has to account for rounding before the multiplication is done as well as the fact that the multiplication is inexact.
[ ] Discuss if we should account for the precision for the normal case (i.e. when arf_mul_rnd_sloppy is used) -- should we round the result or not?

Feb 26 '24 19:02 albinahlback

Yeah, I wouldn't do this before having a generic C fallback for and fixing the performance issues for flint_mpn_mulhigh / flint_mpn_mulhigh_normalised, so that these functions can be relied upon without conditional directives creeping into higher level code.

It will certainly not work to just take the shorter of the input lengths as the precision in arf_mul_rnd_sloppy. In general, the inputs can be shorter or longer than the precision, or a mixture. A bit of logic is needed for this. It will even be optimal to zero-pad operands in some cases.

Note that it is also necessary to normalise the arf_t output by removing trailing zero limbs.

Feb 27 '24 16:02 fredrik-johansson

I think this work would be aided by having a general flint_mpn_mulhigh which computes the top n (+ 1) limbs of an m x p product, m + p >= n (+ 1).

Such a method would be a bit complex, but it would offload complexity from the arf/arb side.

Feb 28 '24 10:02 fredrik-johansson

In general, isn't it true that an arb_t is on the form $x = a 2^{m} \pm b 2^{n}$ with $a, b, m, n \in \mathbb{Z}$ and $|b 2^{n}| > 2^{m}$? Or am I thinking wrong that only calculating an balanced high multiplication for unbalanced multiplicants will not give a so-much-bigger error bound?

Feb 28 '24 10:02 albinahlback

mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function

ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);

which requires space for all an+bn limbs in x but is only responsible for writing [xlo,xhi). If the return is e, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.

May 25 '24 01:05 tthsqe12

mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function
ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);
which requires space for all an+bn limbs in x but is only responsible for writing [xlo,xhi). If the return is e, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.

Yep, that should improve the division and square root code quite a bit.

mulhigh is already doing good things for nfloat though. I think I want to iron out all the basic algorithms there before trying to improve arf/arb.

May 25 '24 11:05 fredrik-johansson

mulhi? I agree arbs would like this, but Newton's method likes mulmid more. I have been experimenting with the function
ulong mpn_mulmid_approx(ulong* x, ulong xlo, ulong xhi, const ulong* a, ulong an, const ulong* b, ulong bn);
which requires space for all an+bn limbs in x but is only responsible for writing [xlo,xhi). If the return is e, then the true middle limbs lie somewhere inclusively between the produced limbs and the produced limbs + e. Usually the return is 0.

Depends on what Newton's method you are talking about. I believe that precomputed inverses and reciprocal square roots via Newton's method really favors mulhi.

May 25 '24 12:05 albinahlback

mulhi is a special case of mulmid, and some of your mulhis for inversion don't need all of the high bits. It is even in the TODO.md:

`nmod_poly`

Implement fast mulmid and use to improve Newton iteration

fmpzs are just nmod_polys with carries, so the same principles apply.

May 29 '24 03:05 tthsqe12

You can see this in https://github.com/flintlib/flint/blob/80e9b245ca983bb40dc83c2a289302a50f02db16/src/gr_poly/inv_series_newton.c#L57 where the low m coefficients of the mullow output are never used.

(Of course high/low Newton divison are the same but reversed.)

May 29 '24 07:05 fredrik-johansson

For beyond 20M bits, I am seeing a ~30% improvement in arb_inv with this mulmid over what is currently in flint. (Still have to tune it in the medium range.)

Jul 16 '24 19:07 tthsqe12

For beyond 20M bits, I am seeing a ~30% improvement in arb_inv with this mulmid over what is currently in flint. (Still have to tune it in the medium range.)

That sounds excellent.

Of course, the current arb code is a placeholder implementation using arf arithmetic. For medium precision, having everything in mpn form should be a bit better.

Jul 20 '24 12:07 fredrik-johansson