libtommath icon indicating copy to clipboard operation
libtommath copied to clipboard

should i>=x.used better?

Open TobyFIenderson opened this issue 3 years ago • 2 comments

https://github.com/libtom/libtommath/blob/04e9d1e7a0493910b2eb5e757d623870692ada04/s_mp_div_school.c#L68

TobyFIenderson avatar Mar 07 '22 14:03 TobyFIenderson

Could you elaborate a bit further?

czurnieden avatar Mar 07 '22 16:03 czurnieden

I mean, in this loop, sometimes the index 'i' may be equal to x.used, but usually, index 'i' should be less than x.used(Because if i==x.used it may be out of the range of dp). But it also may be the author's intention. Just a little advice. :D

TobyFIenderson avatar Mar 08 '22 03:03 TobyFIenderson

Sorry for the...ahem...short delay but here I am.

It is possible because it can depend on x.alloc > x.used and a zeroed "overhang".

This fails in some circumstances if we use i >= x.used. Example (all examples with a MP-16BIT lib!): with 0x3FFFFFFFFFFFFFFFFFFFFFF / 0x1FFFFFFFFFFE the correct result is quotient = 0x200000000002 and remainder = 0x3 but s_mp_div_school gives quotient = 0x200000000000 and remainder = '0x3FFFFFFFFFFF'. 0x3FFFFFFFFFFF / 0x1FFFFFFFFFFE is 2 with remainder 3, which shows that we ended the loop too early.

The reliance on x.alloc > x.used and a zeroed "overhang" can result in surprises: if somebody must get stingy with heap and can eat the cost in speed reduction they might come up with the idea to add mp_shrink at the end of some functions because it is quick and simple (and, in this case, wrong). If you add mp_shrink at the end of s_mp_sub for example and use a dynamic checker (e.g.: -fsanitize=address with clang) it will get triggered now (try 0x3FFFFFFFFFFFFFFFFFFFFFF/0x1FFFFFFF8001).

  • Is it a bug? Not really--it works and gives the correct results.
  • Is it bad style? That lies quite a lot in the eye of the beholder but it has a bit of an haut goût.
  • Should it be changed? That is a very good question. It is one of the old tried and tested algorithms, you don't change it just because you don't like the color of it, it's not a handbag.
  • Should it get some documentation, even if it is just a warning about the dependencies in a short comment in s_mp_div_school.c? Yes, let's be nice to our future developers. For a change.

czurnieden avatar Jan 07 '23 05:01 czurnieden

Sorry for the...ahem...short delay but here I am.

It is possible because it can depend on x.alloc > x.used and a zeroed "overhang".

This fails in some circumstances if we use i >= x.used. Example (all examples with a MP-16BIT lib!): with 0x3FFFFFFFFFFFFFFFFFFFFFF / 0x1FFFFFFFFFFE the correct result is quotient = 0x200000000002 and remainder = 0x3 but s_mp_div_school gives quotient = 0x200000000000 and remainder = '0x3FFFFFFFFFFF'. 0x3FFFFFFFFFFF / 0x1FFFFFFFFFFE is 2 with remainder 3, which shows that we ended the loop too early.

The reliance on x.alloc > x.used and a zeroed "overhang" can result in surprises: if somebody must get stingy with heap and can eat the cost in speed reduction they might come up with the idea to add mp_shrink at the end of some functions because it is quick and simple (and, in this case, wrong). If you add mp_shrink at the end of s_mp_sub for example and use a dynamic checker (e.g.: -fsanitize=address with clang) it will get triggered now (try 0x3FFFFFFFFFFFFFFFFFFFFFF/0x1FFFFFFF8001).

  • Is it a bug? Not really--it works and gives the correct results.
  • Is it bad style? That lies quite a lot in the eye of the beholder but it has a bit of an haut goût.
  • Should it be changed? That is a very good question. It is one of the old tried and tested algorithms, you don't change it just because you don't like the color of it, it's not a handbag.
  • Should it get some documentation, even if it is just a warning about the dependencies in a short comment in s_mp_div_school.c? Yes, let's be nice to our future developers. For a change.

Thanks for answering! very detailed! another question : are you the author of "BigNum Math"?😘

TobyFIenderson avatar Jan 07 '23 06:01 TobyFIenderson

Can it be modified like this: check whether x.used is less than x.alloc before executing the loop, if x.used is equal to x.alloc, grow the memory usage of x, so as to ensure that x.dp[x.used] will not access unknown memory.

TobyFIenderson avatar Jan 07 '23 13:01 TobyFIenderson

another question : are you the author of "BigNum Math"?

No, that would be Tom St Denis who is also the original author of LibTomMath, this software. I don't know if he is still following the discussions here. And yes, he would be to blame for that construct ;-)

Can it be modified like this: check whether x.used is less than x.alloc before executing the loop, if x.used is equal to x.alloc, grow the memory usage of x, so as to ensure that x.dp[x.used] will not access unknown memory.

Messing with (heap) memory is expensive and error-prone, I would just check for i == x.used and test y[t] directly against a literal 0 in that case. But that, too, is not very...uhm,,,elegant. When I ported LTM to JavaScript a couple of years ago (And shortly after that they implemented BigNumber in JS, Yeah! ;-) ) I used D. Knuth's algorithm D (in volume 2 of TAoCP) instead.

But changing the whole algorithm to make it more "elegant"? I don't know.

There is a risk that "KI" (formerly known as "genetic programming" formerly known as "So many rules! Why don't we let the computer find them all? We just need a good name for it!") gets into the compilers, wreaking havoc. But if that happens, all hope is already lost so we don't have to take care about it now ;-)

czurnieden avatar Jan 07 '23 19:01 czurnieden

Was interested if there would be any difference at all if we replace the current algorithm with Knuth's algorithm D so I slapped something together and put it in a GIST. Was fun to do, but found no differences outside the error bars. And it depends on standard compliant compilers, don't know if it would compile with older ones.

czurnieden avatar Jan 14 '23 00:01 czurnieden

I ran the program you wrote on my computer, and the results of the program show that it seems that KNUTH's algorithm is a little bit faster than SCHOOL on my computer, 4.66% faster on average. Here are my test data.

Number of divisions: 319056
And I let OLOOP = ILOOP = 68 cause 100 is too big.

No. KNUTH (sec) SCHOOL (sec) KNUTH faster than SCHOOL
1 56.689 58.263 2.70%
2 56.452 57.632 2.05%
3 56.637 62.762 9.76%
4 55.945 58.468 4.32%
5 54.774 58.344 6.12%
6 55.213 59.941 7.89%
7 55.635 58.466 4.84%
8 55.612 58.038 4.18%
9 54.757 58.323 6.11%
10 55.395 57.816 4.19%
11 55.817 60.319 7.46%
12 55.699 57.353 2.88%
13 56.060 58.025 3.39%
14 55.234 57.736 4.33%
15 55.070 57.646 4.47%
16 55.178 57.758 4.47%
17 55.940 58.782 4.83%
18 55.497 57.445 3.39%
19 55.924 57.935 3.47%
20 54.938 58.134 5.50%
21 54.832 57.203 4.14%
22 54.962 57.616 4.61%
23 54.934 57.363 4.23%
24 54.686 57.062 4.16%
25 54.858 58.903 6.87%
26 55.634 57.678 3.54%
27 56.702 57.661 1.66%
28 54.839 57.424 4.50%
29 55.521 57.728 3.82%
30 55.093 58.532 5.88%
Average 55.484 58.212 4.66%

If there is no problem like x.dp[x.used] in your algorithm, I think the algorithm you implemented is better than SCHOOL.

TobyFIenderson avatar Jan 14 '23 05:01 TobyFIenderson

4.66% faster on average.

That's inside the error bars and will probably decrease further if the algo-D code would be made ready for production. It assumes, among other things, that the output pointers are not NULL, so we need to check for that and allocate tmp-memory if necessary; I produce complements by overflowing an "unsigned int" and we would need to check if all supported compilers support that method properly or implement it manually. Last but not least: it is a quick hack to get it into a single file, who knows what other abominations hide in there ;-)

I will prepare a PR this WE with the simple check I proposed above. Even if it is just to save the AI-data-gatherer from scraping up code that is dangerous if taken out of context, for it might end up in software you do not want it to end up.

But I do have a question:

And I let OLOOP = ILOOP = 68 cause 100 is too big.

Why is it too big? Any specific problems beside runtime?

czurnieden avatar Jan 14 '23 19:01 czurnieden

Why is it too big? Any specific problems beside runtime?

No, there are no other problems. I change OLOOP and ILOOP to 68 just because it will cost me very long time when OLOOP and ILOOP are 100. 🤣

Here is the output of console when OLOOP and ILOOP are 100.

KNUTH
SCHOOL
Number of divisions: 1010000
Knuth:  368 sec 400 ms 282 usec 200 nsec
School: 375 sec 887 ms 288 usec 700 nsec

Success

Each test consumes about 10 minutes. It will cost me 5 hours to test 30 times. 😪

TobyFIenderson avatar Jan 15 '23 06:01 TobyFIenderson

And I thought I have an old and slow CPU (AMD A8-6600K from 2018 ) ;-)

No, I thought it would be an error of some kind if you try more.

And I found an error in the published code (updated). Quite a grave one, sorry.

I tried to replace LTM'S s_mp_school() with that code, but it suddenly doesn't work anymore. Well, something to do for another day :-)

Ah, nearly forgot, I put the mentioned little check in a PR: #540

czurnieden avatar Jan 15 '23 06:01 czurnieden

Well, something to do for another day :-)

"another day" means 10 months later. (JK) 😁 The year before last, I accidentally found the book "BigNum Math" more than ten years ago. When I went from https://www.libtom.net/LibTomMath/ to this place, I was surprised to find this decade-old project is still being maintained today. Very glad to see work on libtommath is still going on.

TobyFIenderson avatar Jan 15 '23 07:01 TobyFIenderson

"another day" means 10 months later.

Not this time! (But in a lot of these cases, admitted ;-)

Updated GIST. This time with OpenMP support! Only two and a half hours instead of five! (The innermost loop could also get its own threads but I doubt it would bring it down to reasonable times)

Made a branch in my own fork of LTM.

czurnieden avatar Jan 15 '23 19:01 czurnieden

Made the benchmark completely multithreaded with a little option-parser instead of precompiler abuse and added some statistics, and a kitchensink. Try time ./check_div_school_against_div_bit -n 1000 -O 5 -I 1000 -H 200 -d for a start. Finishes in under 2 minutes here and dumps a lot of data..

czurnieden avatar Jan 29 '23 02:01 czurnieden