Kangaroo icon indicating copy to clipboard operation
Kangaroo copied to clipboard

Optimization

Open ghost opened this issue 5 years ago • 39 comments

I changed the starting points and limited the number of kangaroos. I am currently working on a CPU and I have received the following results:

CPU i3, 534.2K j/s, 64 bit key, 100 tests - mean runtime ~ 3 hours

I have to say how the tests were conducted so that there were no questions about this. Each time a key was found, a new random 64-bit key was created and the search algorithm was run again

Now I'm busy porting code to OpenCL, this is a big and complicated job for me.

I think it's still possible to optimize the key finding time by calculating the correct jump size for each kangaroo.

Any ideas?

ghost avatar May 08 '20 17:05 ghost

how did you do this: "Each time a key was found, a new random 64-bit key was created and the search algorithm was run again" the random part more specifically?

CatfishCrypt avatar May 08 '20 19:05 CatfishCrypt

the random part more specifically?

  1. Generate random number 2^64:2^65
  2. Create secp256k1 point with generated number
  3. Run kangaroo
  4. After key is find and valid - repeat

I did this to understand how much faster the algorithm works with the changes I made. The standard algorithm completes on average in 5 hours

ghost avatar May 08 '20 21:05 ghost

Interesting...do you have to manually load each new key and then run Kang or do you have this automated somehow?

CatfishCrypt avatar May 08 '20 23:05 CatfishCrypt

I have automated this. Everything is very simple there, I added a loop and deleted the command line arguments

ghost avatar May 09 '20 02:05 ghost

Simple for someone who knows their way away programming :) Do you feel you are finding keys faster via this way?

CatfishCrypt avatar May 09 '20 04:05 CatfishCrypt

Hi,

Please be more precise: You translated (+2^64) the start range of Tame or Wild ? You 100 keys are the same or uniformly distributed ? To make average it is important to have key uniformly distributed in the range.

You can speed up the search by spreading the wild to -N/8,N/8, even -N/16,N/16 CreateHerd() function in Kangaroo.cpp This optimize the overlap between tame and wild, and you can achieve up to 1.25sqrt(N) by compressing the wild but this has side effects when the key to search is near the border of the range and increase also collision between wild...

JeanLucPons avatar May 09 '20 04:05 JeanLucPons

Jean Luc,

Using the Div8 in CreateHerd?

CatfishCrypt avatar May 09 '20 05:05 CatfishCrypt

Yes, replace the code between in Kangaroo.cpp:574,579

by

    if((j + firstType) % 2 == TAME) {
      // Tame in [0..N]
      d[j].Rand(rangePower);
    } else {
      // Wild in [-N/8..N/8]
      d[j].Rand(rangePower-2);
      d[j].ModSubK1order(&rangeWidthDiv8);
    }

You win a lot for key around the center but keys on the border will longer to solve , in average you still win.

JeanLucPons avatar May 09 '20 06:05 JeanLucPons

Can it be put in lines 581:586? Lines 574:579 are grayed out; that's the symmetry code-grayed out.

CatfishCrypt avatar May 09 '20 07:05 CatfishCrypt

Current code:

#ifdef USE_SYMMETRY

		// Tame in [0..N/2]
		d[j].Rand(rangePower - 1);
		if ((j + firstType) % 2 == WILD) {
			// Wild in [-N/4..N/4]
			d[j].ModSubK1order(&rangeWidthDiv4);
		}

#else

		// Tame in [0..N]
		d[j].Rand(rangePower);
		if ((j + firstType) % 2 == WILD) {
			// Wild in [-N/2..N/2]
			d[j].ModSubK1order(&rangeWidthDiv2);
		}

#endif

CatfishCrypt avatar May 09 '20 07:05 CatfishCrypt

It is well in the second part 574:579, USE_SYMMETRY is not defined. If in your code the it is 581:586, you may have added line of code ?

https://github.com/JeanLucPons/Kangaroo/blob/ea1312161b4b490c6e7348a2baf5083c45f28255/Kangaroo.cpp#L574

  // Choose random starting distance
  if(lock) LOCK(ghMutex);

  for(uint64_t j = 0; j<nbKangaroo; j++) {

#ifdef USE_SYMMETRY

    // Tame in [0..N/2]
    d[j].Rand(rangePower - 1);
    if((j+ firstType) % 2 == WILD) {
      // Wild in [-N/4..N/4]
      d[j].ModSubK1order(&rangeWidthDiv4);
    }

#else

    if((j + firstType) % 2 == TAME) {
      // Tame in [0..N]
      d[j].Rand(rangePower);
    } else {
      // Wild in [-N/8..N/8]
      d[j].Rand(rangePower-2);
      d[j].ModSubK1order(&rangeWidthDiv8);
    }

#endif

    pk.push_back(d[j]);

  }

JeanLucPons avatar May 09 '20 07:05 JeanLucPons

Do you feel you are finding keys faster via this way?

Yes

ghost avatar May 09 '20 07:05 ghost

Hi @JeanLucPons

You 100 keys are the same or uniformly distributed ? You translated (+2^64) the start range of Tame or Wild ? You can speed up the search by spreading the wild to -N/8,N/8, even -N/16,N/16 This optimize the overlap between tame and wild, and you can achieve up to 1.25sqrt(N) by compressing the wild but this has side effects when the key to search is near the border of the range and increase also collision between wild...

I generated keys randomly, duplicates are excluded. I also created a file with the keys and ran the test again to make sure that there were no duplicate keys and the results were correct. The result shows a significant increase in the key finding speed.

I tested the keys and in a small range (2^32 - 2^63) in all ranges the speed of finding the key increases.

I use Four kangaroo method. The problem is that the more kangaroos you use, the slower the key will be found. Optimal use 4 kangaroos - 2T and 2W. You can find research on it. Now, using 4 kangaroos, we can parallelize their work by creating groups in which 4 kangaroos will work. If you want maximum speed, all herds (groups) should work on the same key, but at the same time have different initial parameters. I'm working on parallelization right now

ghost avatar May 09 '20 08:05 ghost

@AndrewBrz

when you say 4 kangaroos, you mean 4 herds of kangaroos? I've read something to that extent T1 T2, W1 W2. How are you implementing that?

CatfishCrypt avatar May 09 '20 08:05 CatfishCrypt

@CatfishCrypt 4 kangaroos it is 1 herd, T1, T2, W1 and W2

ghost avatar May 09 '20 08:05 ghost

Wow...Doesn't seem right, mathematically speaking. Huge space, with only 4 kangaroos versus thousands/millions.

CatfishCrypt avatar May 09 '20 09:05 CatfishCrypt

Ok

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

Take also in consideration the GPU handle large number of kangaroos, there is some users with several GPU that use 2^24 or more kangaroos with large distinguished bit number on large range. The GPU performance is much depend on the access to global memory, so the less access to global memory you have, better it is.

I will work soon on a distributed version (client/server) where the server will handle DP and collision check.

JeanLucPons avatar May 09 '20 09:05 JeanLucPons

Wow...Doesn't seem right, mathematically speaking. Huge space, with only 4 kangaroos versus thousands/millions.

You can create many kangaroo herds, but all of them should work independently of each other, but you can try and add communication between them (I still have no idea how to do this without sacrificing speed). There should be 4 kangaroos in one herd. This is mathematically correct

ghost avatar May 09 '20 09:05 ghost

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

I will do all the tests as I finish on the OpenCL version

ghost avatar May 09 '20 09:05 ghost

@JeanLucPons

The server version is exactly what I was thinking! That would be awesome...and I believe would reduce solution time by Eleventy Gabillionsqrt(N) :)

CatfishCrypt avatar May 09 '20 09:05 CatfishCrypt

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

I will do all the tests as I finish on the OpenCL version

You write an OpenCL version for the standard method of for 4 kangaroo method ?

JeanLucPons avatar May 09 '20 12:05 JeanLucPons

I tested tested the Wild in [-N/8..N/8], 1000 trials, 40 bits search. 2.139 is the estimation of the expected average taking into account the DP overhead.

Original code: -t 2 -d 5 in40_1000.txt [999] 2^21.926 Dead:5 Avg:2^21.147 DeadAvg:1.4 (2.214sqrt(N) 2.139sqrt(N))

With the modification above: -t 2 -d 5 in40_1000.txt [999] 2^20.585 Dead:3 Avg:2^20.939 DeadAvg:2.6 (1.917sqrt(N) 2.139sqrt(N))

Gain: 15% 2 times more dead due to the compression of wild. This trick works only for parallel version with enough kangaroo.

JeanLucPons avatar May 09 '20 14:05 JeanLucPons

You write an OpenCL version for the standard method of for 4 kangaroo method ?

Standart, three method and 4 method. There are not big changes in the code, so you can add and remove kangaroos for tests.

ghost avatar May 09 '20 20:05 ghost

Gain: 15% 2 times more dead due to the compression of wild.

how much kangaroo was used T and W?

ghost avatar May 09 '20 20:05 ghost

@JeanLucPons Change this line bool Point::equals(Point &p) { return x.IsEqual(&p.x) && y.IsEqual(&p.y) && z.IsEqual(&p.z); } to bool Point::equals(Point &p) { return x.IsEqual(&p.x); }

This change will only check X coordinates. Another optimization is to calculate only the X coordinate, which allows us to accelerate the speed of calculations (there are research and you can find and read them) There are many more optimizations for jumps and starting points.

ghost avatar May 09 '20 20:05 ghost

What changes have you experimented with for starting and jump points? More interested in how you changed your start points, you change them to start other than around the center?

CatfishCrypt avatar May 09 '20 21:05 CatfishCrypt

Thanks Andrew for your job ;)

  • There was 1024T and 1024W.
  • Concerning the equals, I prefer to keep as it is, if you hit a symmetric point (even without using symmetry), the check with keyToSearch and keyToSearchNeg will fail and it can miss a point. If you want to make such an optimization, do it locally by using Int::IsEqual(). For instance, in the hashTable, this optimization is done locally and only a part of x is tested.
  • Concerning jumps, take care of keeping compatibility with workfiles. Of course depending on the used method they probably won't be compatible but it would be great if 2 users using the same method can share work files. There is a version field that can be used. I remarked that I missed the version check in the 1.4, F.ck !

JeanLucPons avatar May 10 '20 03:05 JeanLucPons

@AndrewBrz if you make AMD support please share with others

kpot87 avatar May 10 '20 19:05 kpot87

if you make AMD support please share with others

I think he will not be share. After your proposal, he does not respond.

DvR4 avatar May 13 '20 13:05 DvR4

That's too bad. I hope he will share his work.

JeanLucPons avatar May 13 '20 14:05 JeanLucPons