libtommath
libtommath copied to clipboard
Faster mp_toradix?
For background, I starting looking at this when a profile of a Perl 6 program that calculated Ackerman numbers showed that 90% of the time was spent stringifying a big Integer to print the result. I was using Rakudo with the MoarVM backend which uses libtommath under the hood (the stringifying was essentially just a call to mp_toradix
, https://github.com/MoarVM/MoarVM/blob/master/src/math/bigintops.c#L1023). I looked at the implementation of mp_toradix
and wondered if there was a faster algorithm. I started searching and found this thread http://code.activestate.com/lists/tcl-core/13692/ (link doesn't seem to work anymore, but it's also available here https://sourceforge.net/p/tcl/mailman/message/31972670/). This is my rough proof-of-concept transliteration into C using libtommath functions https://gist.github.com/MasterDuke17/956999b54093258f4a7e616a00d95ac2 (it wasn't until after I finished that I noticed some Barrett reduction related functions in libtommath). This is significantly faster. On my machine mp_toradix
of 2**500000 takes ~13s, but Barrett_todigits
takes ~0.3s. I haven't extensively benchmarked small and large values to see if using Barrett only makes sense after a certain size, but it definitely seems worthwhile at large values. Would there be any interest in using this for libtommath? Either as the implementation of mp_toradix
, or maybe an alternative function?
Yes, the mp_toradix
function is really slow, that's for sure! ;-)
We are doing a lot of refactoring now to bring LibTomMath up-to-date (and to get rid of a lot of legacy stuff that is definitely gone obsolete by now). Speeding things up is next (some things are already done, e.g.: faster "fast multiplication", a usable nth-root and others). I don't know how the rest thinks about it but a simple divide&conquer algorithm for to_radix
would do (Schoenhage was published it first, I think?) but that needs fast division. All that is already written (I have it in my own version for a couple of years now, i.e.: tried and tested) it just needs a "port" to the current LTM version.
Advantage of that D&C algorithm: very simple and starts to be faster at low values (I have the cut-off at 600 bits).
Another problem: mp_radix_size
is O(n^2) instead of O(1). Look at the current code and you know why ;-)
Replacement has been written (table based, i.e.: O(1) ) and fully tested, it just needs to be "ported".
It has a difference, though, it can overestimate up to one digit (three digits with MP_8BIT) just like the GMP equivalent, the original on the other side was always exact. But there is mp_ilogb
now, which can be used to give an exact result in roughly O(log n).
So why for Knuth's sake isn't it already done you rightfully ask?
Because the head-maintainer and a lot of other people here, myself included, are currently drowning in work. The kind of work that pays the rent which understandably comes first, sorry. I don't know when it will get better and cannot speak for all the others but for me it is the whole of august at least.
But after a closer look: it is interesting, the single problem seems to be the high value of the cutoff-point. Mmh… Why don't you fill in your sketch and make an official pull request? It would make it much easier to test and discuss it.
Cool, glad to see work on libtommath is still going on. Let me caveat this upfront by saying I'm not a number theorist, not even any kind of mathematician, and only just barely a C programmer. I've been looking at the *_reduce_*
functions and still haven't figured out how to incorporate them into my proof-of-concept. But...I'll clean up my existing code a little bit and do some benchmarks of different sizes of values and submit a PR.
I'm using libtommath 1.1.0 and also noticed that mo_toradix is slow. I took the existing implementation and managed to speed it up 3 to 4 times when the radix is 10, but it is still considerably slower (up to 50 times) than Python (indeed on representing 2<<500000 and larger shifts). I'm going to look into the other solution presented here. I looked at the convert to BCD algorithms that are popular, but don't see how fast it could be. Any help would be welcome, because such a large difference in speed suggests a totally different approach.