should i>=x.used better?
https://github.com/libtom/libtommath/blob/04e9d1e7a0493910b2eb5e757d623870692ada04/s_mp_div_school.c#L68
Could you elaborate a bit further?
I mean, in this loop, sometimes the index 'i' may be equal to x.used, but usually, index 'i' should be less than x.used(Because if i==x.used it may be out of the range of dp). But it also may be the author's intention. Just a little advice. :D
Sorry for the...ahem...short delay but here I am.
It is possible because it can depend on x.alloc > x.used and a zeroed "overhang".
This fails in some circumstances if we use i >= x.used.
Example (all examples with a MP-16BIT lib!): with 0x3FFFFFFFFFFFFFFFFFFFFFF / 0x1FFFFFFFFFFE the correct result is quotient = 0x200000000002 and remainder = 0x3 but s_mp_div_school gives quotient = 0x200000000000 and remainder = '0x3FFFFFFFFFFF'.
0x3FFFFFFFFFFF / 0x1FFFFFFFFFFE is 2 with remainder 3, which shows that we ended the loop too early.
The reliance on x.alloc > x.used and a zeroed "overhang" can result in surprises: if somebody must get stingy with heap and can eat the cost in speed reduction they might come up with the idea to add mp_shrink at the end of some functions because it is quick and simple (and, in this case, wrong).
If you add mp_shrink at the end of s_mp_sub for example and use a dynamic checker (e.g.: -fsanitize=address with clang) it will get triggered now (try 0x3FFFFFFFFFFFFFFFFFFFFFF/0x1FFFFFFF8001).
- Is it a bug? Not really--it works and gives the correct results.
- Is it bad style? That lies quite a lot in the eye of the beholder but it has a bit of an haut goût.
- Should it be changed? That is a very good question. It is one of the old tried and tested algorithms, you don't change it just because you don't like the color of it, it's not a handbag.
- Should it get some documentation, even if it is just a warning about the dependencies in a short comment in
s_mp_div_school.c? Yes, let's be nice to our future developers. For a change.
Sorry for the...ahem...short delay but here I am.
It is possible because it can depend on
x.alloc > x.usedand a zeroed "overhang".This fails in some circumstances if we use
i >= x.used. Example (all examples with aMP-16BITlib!): with0x3FFFFFFFFFFFFFFFFFFFFFF / 0x1FFFFFFFFFFEthe correct result is quotient =0x200000000002and remainder =0x3buts_mp_div_schoolgives quotient =0x200000000000and remainder = '0x3FFFFFFFFFFF'.0x3FFFFFFFFFFF / 0x1FFFFFFFFFFEis2with remainder3, which shows that we ended the loop too early.The reliance on
x.alloc > x.usedand a zeroed "overhang" can result in surprises: if somebody must get stingy with heap and can eat the cost in speed reduction they might come up with the idea to addmp_shrinkat the end of some functions because it is quick and simple (and, in this case, wrong). If you addmp_shrinkat the end ofs_mp_subfor example and use a dynamic checker (e.g.:-fsanitize=addresswithclang) it will get triggered now (try0x3FFFFFFFFFFFFFFFFFFFFFF/0x1FFFFFFF8001).
- Is it a bug? Not really--it works and gives the correct results.
- Is it bad style? That lies quite a lot in the eye of the beholder but it has a bit of an haut goût.
- Should it be changed? That is a very good question. It is one of the old tried and tested algorithms, you don't change it just because you don't like the color of it, it's not a handbag.
- Should it get some documentation, even if it is just a warning about the dependencies in a short comment in
s_mp_div_school.c? Yes, let's be nice to our future developers. For a change.
Thanks for answering! very detailed! another question : are you the author of "BigNum Math"?😘
Can it be modified like this: check whether x.used is less than x.alloc before executing the loop, if x.used is equal to x.alloc, grow the memory usage of x, so as to ensure that x.dp[x.used] will not access unknown memory.
another question : are you the author of "BigNum Math"?
No, that would be Tom St Denis who is also the original author of LibTomMath, this software. I don't know if he is still following the discussions here. And yes, he would be to blame for that construct ;-)
Can it be modified like this: check whether x.used is less than x.alloc before executing the loop, if x.used is equal to x.alloc, grow the memory usage of x, so as to ensure that x.dp[x.used] will not access unknown memory.
Messing with (heap) memory is expensive and error-prone, I would just check for i == x.used and test y[t] directly against a literal 0 in that case. But that, too, is not very...uhm,,,elegant. When I ported LTM to JavaScript a couple of years ago (And shortly after that they implemented BigNumber in JS, Yeah! ;-) ) I used D. Knuth's algorithm D (in volume 2 of TAoCP) instead.
But changing the whole algorithm to make it more "elegant"? I don't know.
There is a risk that "KI" (formerly known as "genetic programming" formerly known as "So many rules! Why don't we let the computer find them all? We just need a good name for it!") gets into the compilers, wreaking havoc. But if that happens, all hope is already lost so we don't have to take care about it now ;-)
Was interested if there would be any difference at all if we replace the current algorithm with Knuth's algorithm D so I slapped something together and put it in a GIST. Was fun to do, but found no differences outside the error bars. And it depends on standard compliant compilers, don't know if it would compile with older ones.
I ran the program you wrote on my computer, and the results of the program show that it seems that KNUTH's algorithm is a little bit faster than SCHOOL on my computer, 4.66% faster on average. Here are my test data.
Number of divisions: 319056
And I let OLOOP = ILOOP = 68 cause 100 is too big.
| No. | KNUTH (sec) | SCHOOL (sec) | KNUTH faster than SCHOOL |
|---|---|---|---|
| 1 | 56.689 | 58.263 | 2.70% |
| 2 | 56.452 | 57.632 | 2.05% |
| 3 | 56.637 | 62.762 | 9.76% |
| 4 | 55.945 | 58.468 | 4.32% |
| 5 | 54.774 | 58.344 | 6.12% |
| 6 | 55.213 | 59.941 | 7.89% |
| 7 | 55.635 | 58.466 | 4.84% |
| 8 | 55.612 | 58.038 | 4.18% |
| 9 | 54.757 | 58.323 | 6.11% |
| 10 | 55.395 | 57.816 | 4.19% |
| 11 | 55.817 | 60.319 | 7.46% |
| 12 | 55.699 | 57.353 | 2.88% |
| 13 | 56.060 | 58.025 | 3.39% |
| 14 | 55.234 | 57.736 | 4.33% |
| 15 | 55.070 | 57.646 | 4.47% |
| 16 | 55.178 | 57.758 | 4.47% |
| 17 | 55.940 | 58.782 | 4.83% |
| 18 | 55.497 | 57.445 | 3.39% |
| 19 | 55.924 | 57.935 | 3.47% |
| 20 | 54.938 | 58.134 | 5.50% |
| 21 | 54.832 | 57.203 | 4.14% |
| 22 | 54.962 | 57.616 | 4.61% |
| 23 | 54.934 | 57.363 | 4.23% |
| 24 | 54.686 | 57.062 | 4.16% |
| 25 | 54.858 | 58.903 | 6.87% |
| 26 | 55.634 | 57.678 | 3.54% |
| 27 | 56.702 | 57.661 | 1.66% |
| 28 | 54.839 | 57.424 | 4.50% |
| 29 | 55.521 | 57.728 | 3.82% |
| 30 | 55.093 | 58.532 | 5.88% |
| Average | 55.484 | 58.212 | 4.66% |
If there is no problem like x.dp[x.used] in your algorithm, I think the algorithm you implemented is better than SCHOOL.
4.66% faster on average.
That's inside the error bars and will probably decrease further if the algo-D code would be made ready for production. It assumes, among other things, that the output pointers are not NULL, so we need to check for that and allocate tmp-memory if necessary; I produce complements by overflowing an "unsigned int" and we would need to check if all supported compilers support that method properly or implement it manually. Last but not least: it is a quick hack to get it into a single file, who knows what other abominations hide in there ;-)
I will prepare a PR this WE with the simple check I proposed above. Even if it is just to save the AI-data-gatherer from scraping up code that is dangerous if taken out of context, for it might end up in software you do not want it to end up.
But I do have a question:
And I let OLOOP = ILOOP = 68 cause 100 is too big.
Why is it too big? Any specific problems beside runtime?
Why is it too big? Any specific problems beside runtime?
No, there are no other problems. I change OLOOP and ILOOP to 68 just because it will cost me very long time when OLOOP and ILOOP are 100. 🤣
Here is the output of console when OLOOP and ILOOP are 100.
KNUTH
SCHOOL
Number of divisions: 1010000
Knuth: 368 sec 400 ms 282 usec 200 nsec
School: 375 sec 887 ms 288 usec 700 nsec
Success
Each test consumes about 10 minutes. It will cost me 5 hours to test 30 times. 😪
And I thought I have an old and slow CPU (AMD A8-6600K from 2018 ) ;-)
No, I thought it would be an error of some kind if you try more.
And I found an error in the published code (updated). Quite a grave one, sorry.
I tried to replace LTM'S s_mp_school() with that code, but it suddenly doesn't work anymore.
Well, something to do for another day :-)
Ah, nearly forgot, I put the mentioned little check in a PR: #540
Well, something to do for another day :-)
"another day" means 10 months later. (JK) 😁 The year before last, I accidentally found the book "BigNum Math" more than ten years ago. When I went from https://www.libtom.net/LibTomMath/ to this place, I was surprised to find this decade-old project is still being maintained today. Very glad to see work on libtommath is still going on.
"another day" means 10 months later.
Not this time! (But in a lot of these cases, admitted ;-)
Updated GIST. This time with OpenMP support! Only two and a half hours instead of five! (The innermost loop could also get its own threads but I doubt it would bring it down to reasonable times)
Made a branch in my own fork of LTM.
Made the benchmark completely multithreaded with a little option-parser instead of precompiler abuse and added some statistics, and a kitchensink.
Try time ./check_div_school_against_div_bit -n 1000 -O 5 -I 1000 -H 200 -d for a start. Finishes in under 2 minutes here and dumps a lot of data..