ProgPOW icon indicating copy to clipboard operation
ProgPOW copied to clipboard

Index cache with byte offsets

Open solardiz opened this issue 5 years ago • 4 comments

ProgPoW currently uses a source register content modulo cache size as the word index into cache. This requires (implicit) left shift by 2 in the GPU hardware to produce the byte offset. Such left shift might or might not have a runtime performance cost, depending on compiler and (micro-)architecture.

This (potential) runtime cost may be reliably avoided by applying a mask to the source register content such that the byte offset is extracted right away, without needing a further shift. This will change the computed hash values, but not other properties of ProgPoW (those values are supposed to be random anyway).

Here are the changes I tested on top of my current (revised in other ways) ProgPoW tree:

+++ b/libprogpow/ProgPow.cpp
@@ -113,9 +113,13 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
   ret << "uint32_t offset, data;\n";
 
   if (kern == KERNEL_CUDA)
+  {
+      ret << "const unsigned char *c_dag_uc = (const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES-1);\n";
+  }
   else
   {
+      ret << "__local const unsigned char *c_dag_uc = (__local const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = get_local_id(0) & (PROGPOW_LANES-1);\n";
       ret << "const uint32_t group_id = get_local_id(0) / PROGPOW_LANES;\n";
   }
@@ -152,8 +157,15 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          if (kern == KERNEL_CUDA)
+          {
+              ret << "data = *(const uint32_t *)&c_dag_uc[offset];\n";
+          }
+          else
+          {
+              ret << "data = *(__local const uint32_t *)&c_dag_uc[offset];\n";
+          }
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

For me, this improves the hashrate on Vega 64 and GTX 1080 by about 1% and on GTX Titan X Maxwell by about 2%. Yes, this is in my Maxwell-friendly tree. Speedups on the newer GPUs need to be confirmed on the original Maxwell-unfriendly ProgPoW as well, which I haven't done yet (am experimenting with more tweaks anyway), but I expect them to be about 1% as well (unless ProgPoW is fully memory-bound, in which case the would-be-speedup can instead be extracted to perform more random math, etc.)

Another way to implement this change is:

+++ libprogpow/ProgPow.cpp  2019-05-06 14:31:44.081259833 +0000
@@ -153,8 +157,8 @@
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          ret << "data = c_dag[offset >> 2];\n";
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

This is simpler in source code, but relies on the compiler figuring out that the explicit right shift by 2 cancels out with the array indexing's implicit left shift by 2. In my testing, this appears to provide a slightly smaller speedup than the lengthier patch above.

Edits: fixed bugs in the first version of the patch, which were really nasty yet didn't significantly affect the observed speedups.

solardiz avatar May 06 '19 14:05 solardiz

@ifdefelse Would you like me to send you a PR implementing this? Perhaps as version 0.9.4? Should we get the parameters changes for 0.9.3 in first, with a separate commit? (I think so.)

solardiz avatar Jun 19 '19 19:06 solardiz

We're pretty against these micro-optimizations that make the code significantly harder to read and understand but don't necessarily help the saturation of the hardware. This is chasing hashrate - a meaingless metric - but not necessarily making the algorithm more "ASIC Resistant".

Are you able to do some performance analysis from the NSight Compute Profiler and CodeXL to look at the impact it has on the hardware? This would be a good first step to deciding whether it makes sense to include it.

ifdefelse avatar Jun 20 '19 00:06 ifdefelse

Higher hashrate from a specific design and code change, with everything else staying the same, isn't a meaningless metric because it implies a higher rate of computation of everything else, and thus greater utilization of all other units that the code uses at all. The index shift avoided with this change would have zero cost on ASIC (just different wiring), but does have a slight cost on GPU.

I'll see if I find time to do some additional performance analysis with this change on top of otherwise pristine ProgPoW.

I agree this is minor, and I am playing with far more important tweaks, such as cache writes and dropping of the too-cheap math operations in favor of more MULs. But I also felt it's the most obvious and least invasive change, even though it does complicate source code.

solardiz avatar Jun 20 '19 09:06 solardiz

I've just tested these changes on top of current pristine ProgPoW 0.9.2 (as in this repo) at block 7M. They still do provide a 1% speedup on Vega 64 (~22.55M to ~22.80M), but not on GTX 1080 (unlike what I saw when having them on top of my other changes). Maybe we're already fully bumping into the memory bandwidth on GTX 1080.

Reviewing the PTX asm, I see the code changes as follows:

-       mul.wide.u32    %rd37, %r997, 4;
-       and.b64         %rd38, %rd37, 16380;
-       add.s64         %rd39, %rd29, %rd38;
-       ld.shared.u32   %r793, [%rd39];
+       and.b32         %r795, %r1008, 16380;
+       cvt.u64.u32     %rd35, %r795;
+       add.s64         %rd36, %rd29, %rd35;
+       ld.shared.u32   %r796, [%rd36];

So at PTX asm level, previously a widening MUL was used and it resulted in extension to 64-bit. With these changes the MUL is gone (as expected) but a separate instruction is needed for conversion from 32-bit to 64-bit. This says nothing about whether or not an equivalent difference persists in native ISA code, which is very different. If this does free up a MUL, then this may allow us to increase the frequency of MULs in random math/merge some further.

solardiz avatar Jun 22 '19 17:06 solardiz