ProgPOW icon indicating copy to clipboard operation
ProgPOW copied to clipboard

ProgPow ASIC possibilities evaluated by 2 experts

Open MoneroCrusher opened this issue 6 years ago • 45 comments

https://github.com/SChernykh/CryptonightR/issues/1#issuecomment-452775046

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

What's IfDefElse's take on this?

MoneroCrusher avatar Jan 19 '19 15:01 MoneroCrusher

No argument, but you understate the complexity of such an ASIC. What you are talking about is a set of integer math or DAG accessing logic blocks with programmable sources and destination. These blocks would need to be programmatically sequenced to form the necessary computational chain. When I say programmatically, I don't mean it in terms of software control, but rather in terms of digital logic block inputs.

ASICs achieve efficiency by maximizing logic block usage with a technique called pipelining. Imagine applying it to progpow. Each stage of the pipe will need to carry its own copy of the mix array, the sequencer state and the multiple transform block inputs. Then of course you need to manage multiple stages competing for dag access. Order of magnitude more complicated than an ethash ASIC!

jean-m-cyr avatar Jan 19 '19 19:01 jean-m-cyr

To be able to fully assess any specialized chip manufacturing, we need to know not only potential power efficiency improvement, but all of the following:

  • Development Cost
  • Development Time (part of cost)
  • Manufacturing cost
  • Performance per manufacturing or purchase $

The analysis in here, except for the DRAM statement (that I will let someone more knowledgeable than me answer) seems to be in line with the potential optimizations that IfDefElse have already outlined. We have seen different claims (10%, 20%, 50%, 800%) of potential efficiency improvement. What I have not seen from anyone is an analysis of the actual effort to build all of this.

salanki avatar Jan 19 '19 20:01 salanki

I had not looked into CryptonightR before. It looks promising but has some limitations. The basis is similar to ProgPoW in that a random sequence of math is generated every few minutes and compiled offline. This is unlike RandomX where every hash has a unique random sequence. RandomX is actually more ASIC-friendly because of the per-hash sequence, which requires online compilation that an ASIC could directly implement.

This comment I think is extremely good, showing that simple changes like Ethash 1a, which just change a constant, will not be effective as ASIC manufacturers can easily anticipate small tweaks like that and build flexibility into their designs: https://github.com/SChernykh/CryptonightR/issues/1#issuecomment-450584391 https://eips.ethereum.org/EIPS/eip-1355

The quoted comment is given in the context of CryptonightR, not ProgPoW. Cryptonight does tiny accesses to DRAM, which are exceptionally inefficient for existing GPU memory controllers. Cryptonight V2 is better after it increased the access from 16 bytes to 64 bytes, but it's still wasting memory bandwidth. ProgPoW increases Ethash's 128 byte access to 256 bytes in order to be efficient on all existing GPU memory controllers.

ProgPoW makes use of a GPU's shared (aka local) memory while CryptonightR does not. Both CryptonightR and ProgPoW do not make use of floating point, tensor cores, L2 cache, graphics stuff (like textures), etc. However this underutilized portions of the GPU should not have any significant effect on power consumption. An ASIC that removed these portions would be have a smaller die size, but not otherwise be meaningfully more efficient.

I don't think a chained ALU array is practical to implement for either ProgPoW or CryptonightR. This is especially the case for ProgPoW where the mix state is 16 * 32 * 32-bits and about 1/3rd of the instructions need to access random locations in the cache.

ifdefelse avatar Jan 21 '19 02:01 ifdefelse

RandomX is actually more ASIC-friendly because of the per-hash sequence, which requires online compilation that an ASIC could directly implement

JIT compilation of RandomX code takes less than 0.5% of the time (~10 μs per hash). A ProgPoW ASIC can get a bigger efficiency gain than this just by having a native 32x32 multiplier [1].

The advantage of "online" compilation is that is forces an ASIC to dedicate die area to instruction decoders and dispatchers (which the CPU already has) rather than using a small FPGA to handle the slowly changing part of the algorithm.

tevador avatar Jan 23 '19 08:01 tevador

@tevador The notion of 'using a small FPGA to handle the slowly changing part of the algorithm' is not a workable solution. Compiling a bit stream to program an FPGA is an hour long process running on a high end system, for even the simplest algorithm. Furthermore, it requires the use of vendor specific proprietary tools that could not run on individual miner platforms.

jean-m-cyr avatar Feb 02 '19 15:02 jean-m-cyr

@jean-m-cyr I'm no expert in this field, but are you suggesting it's impossible to design a custom ASIC with hardwired operation blocks (such as multiplication, addition and popcnt) and only program a small SRAM block that specifies how operations are ordered and what their inputs are?

Based on the description of the ProgPoW loop, the 'bit stream' for ProgPoW would be roughly 400 bits [1] long, which is many orders of magnitude less than a typical bit stream for a big FPGA chip like Xilinx VCU 1525 (several megabytes).

[1] 12 cache loads with a random source and destination and one of 4 merge methods = 2257928402 * 412 total options. 20 math operations with a random source, destination, one of 11 operations and one of 4 merge methods = (32*31*4*11)20 total options. Estimated ~388 bits of entropy in total.

tevador avatar Feb 02 '19 18:02 tevador

400 bits sounds low given that the FPGA would need a RAM interface for DAG access (or at least a bus to the main ASIC) ... Nevertheless, how would you generate that bit stream on the fly? Using which tool set?

388 bits of entropy != 388 bit stream.

Not to mention that FPGAs are notoriously power inefficient.

jean-m-cyr avatar Feb 02 '19 19:02 jean-m-cyr

RAM interface for DAG access

RAM interface would be hardwired, not part of the FPGA.

Nevertheless, how would you generate that bit stream on the fly? Using which tool set?

Probably a custom toolset developed by the company that is designing the ASIC. Given the relative simplicity, it could run on the ARM controller which would precompile bit streams in advance.

388 bits of entropy != 388 bit stream

True, but I don't think the bit stream would need to be more than the order of ~1 kilobit.

tevador avatar Feb 02 '19 19:02 tevador

RAM interface would be hardwired, not part of the FPGA.

Even if the RAM ctlr. is on the main ASIC, the FPGA still requires high speed access to the DAG via the ASIC's mem controller. This implies a wide data bus that supports addressing between the FPGA and ASIC. Going off-chip with a high speed bus is expensive in terms of power.

Probably a custom toolset developed by the company that is designing the ASIC. Given the relative simplicity, it could run on the ARM controller which would precompile bit streams in advance.

FPGA tools contain highly proprietary algorithms that require large amounts of RAM and processing power. I think you underestimate the complexity involved with generating an FPGA bit stream.

True, but I don't think the bit stream would need to be more than the order of ~1 kilobit.

I doubt it! In your entropy calculation you speak of 11 arithmetic ops. and assign 4 bits of entropy. Each of those 11 blocks will require far more than 4 bits in the stream to specify and interconnect the required FPGA elements.

jean-m-cyr avatar Feb 02 '19 19:02 jean-m-cyr

This implies a wide data bus that supports addressing between the FPGA and ASIC.

'FPGA' and 'ASIC' here are not two separate chips. Everything is on a single die. Only the interconnects would use FPGA-style LUTs. The execution units themselves would be hardwired for maximum efficiency.

Each of those 11 blocks will require far more than 4 bits in the stream to specify and interconnect the required FPGA elements.

You have a pair of 32-bit datapaths, which can go into one of 11 execution units. I imagine you can use a multiplexer with 4 bits of SRAM to select the path to the execution engine. Am I wrong?

tevador avatar Feb 02 '19 19:02 tevador

are you suggesting it's impossible to design a custom ASIC with hardwired operation blocks (such as multiplication, addition and popcnt) and only program a small SRAM block that specifies how operations are ordered and what their inputs are?

You have a pair of 32-bit datapaths, which can go into one of 11 execution units. I imagine you can use a multiplexer with 4 bits of SRAM to select the path to the execution engine. Am I wrong?

You're essentially describing how classical CPUs, GPUs, DSPs, etc are designed. A handful of fixed function datapaths and then some muxes that route data between the register file, caches, and datapaths. Whether those muxes are controlled on-the-fly by instructions or semi-statically via a few bits of FPGA-style SRAM doesn't make a meaningful difference.

FPGAs gain significant efficiency when you can create a dataflow pipeline, where intermediate results are never stored but are directly consumed by the next step of the pipeline. This structure doesn't work for ProgPoW. ProgPoW requires a large amount of state (16 lanes * 32 words * 32 bits = 2 kilobyte) that needs to last across many cycles (at least 12 cycles for the 12 cache accesses). Setting up a pipeline to directly compute the random ProgPoW sequence would result in a massive duplication of compute resources (would need all 11 options available for every compute stage, even though only 1 is used) and repeating mix data that mostly doesn't change.

Once the mix data is stored in a register file any design is going to look a lot like a CPU's SIMD unit or a GPU.

ifdefelse avatar Feb 04 '19 20:02 ifdefelse

https://medium.com/@Linzhi/eip-1057-progpow-open-chip-design-for-only-1-cost-power-increase-eip-1057-progpow-d106d9baa6eb

// Random math between two input values uint32_t Math(uint32_t a, uint32_t b, uint32_t r) { switch (r % 11) { case 0: return a + b; case 1: return a * b; case 2: return mul_hi(a, b); case 3: return min(a, b); case 4: return ROTL32(a, b); case 5: return ROTR32(a, b); case 6: return a & b; case 7: return a | b; case 8: return a ^ b; case 9: return clz(a) + clz(b); case 10: return popcount(a) + popcount(b); } }

*) modulo 11 operation, this is quite small logic, ~400gate, 1ck latency, can be a parallel process during mix read so we can hide this latency. *) 32-bit Add, simple logic, ~300gate for a fast one. *) 32-bit Multiplier, mature IP, ~20Kgate for a fast one, since multiplier only have ~4/11 activity rate, we can use a two cycle multiplier to half the area, small possibility to increase delay. *) Rotation operation can easily map to a multiplier, for example I want to calculate ROTL(0x12345678, 8), I can do 0x12345678 * 0x00000100 = 0x0000001234567800, then we just need to OR higher word and lower word together to get 0x34567812. so just cost ~160gate extra logic *) logic operation, A&B only cost 32 gate, A|B 32 gate, A^B 96 gate, it looks like three different instructions but actually extremely small on silicon (<30um²) *) clz and popcount are also very small *) We only need a multiplexer to select output. *) Total size of Math() is about 0.0015mm² on a TSMC16ULP process. *) Merge() is similar but even smaller, only shifter, adder, and tiny logic (no multipliers because constant multiply can be mapped into adder). *) Size of Merge() is roughly ~0.0005mm².

4-stage pipeline CK1: Mix regfile read, two operators in parallel (calculate %11 at same time) CK2: Math() CK3: Merge() CK4: Mix write back to regfile

Task latency is 4 cycles, but we can have 4 independent threads so we can make this pipeline fully loaded. Mix Regfile we use a 1W2R (1 write port, 2 read port) Regfile, mature IP, 8KB (for 4 threads), single cycle read/write, 1GHz operation. If you want load/unload mix data on the fly without disturbing the pipeline, we can upgrade to a 12KB 2W3R Regfile. We then have extra read/write ports for load/unload, and 12KB is enough for 6 tasks (4 running, 1 loading, 1 unloading). An ASIC can implement 10K sets of that block:

*) running at 1GHz frequency *) 0.55V voltage (typical voltage for TSMC16ULP) *) generating ~10T Math() + Merge() throughput per second *) Power estimate roughly 3mW each pipeline, 30W in total. *) No customized circuit/layout *) All standard cells, auto placement route *) Use mature IP only *) No aggressive overclocking *) No aggressive under voltage *) ~8.4K Merge() and ~4.8K Math() per hash *) pipeline includes 1 Merge() and 1 Math() *) pipeline is only the logic part, not the memory part *) the xGB memory are not included *) we have 10T throughput on single chip asic, divided by 8.4K Merge() per hash, means 1.2GHash Performance *) increases die cost/power by about 1%, and causes a die increase of <1mm²

The design (logic part only) can provide 1.2 GHash ProgPoW performance at 30W power. Nvidia GTX1070Ti can provide 15.7 MHash at 115W, but including memory.

Sonia-Chen avatar Mar 29 '19 13:03 Sonia-Chen

@Sonia-Chen Interesting.

How many instances of the 8 KB (or 12 KB) register file would you need for "10K sets of that block"? Wouldn't that be significant die area to include in your estimates and conclusions, which doesn't appear to be included now?

The modulo operations in Math and Merge are compile-time. While you could be generating the inputs to them on the fly, I doubt you'd want to - you'd more likely have a pre-compiled ProgPoW program uploaded to your ASIC from host, like it's done when ProgPoW is run on GPUs.

solardiz avatar Mar 29 '19 15:03 solardiz

Thank you for this technical argument, greatly appreciated. It’s above my head, but on the first glance there are two things that stand out to me:

  • The kiss99 computations are missing.
  • You cannot compare the performance of a “PoW loosely based on ProgPoW without DAG access” to ProgPoW performance. The performance comparisons you state with a 1070Ti are completely invalid.

salanki avatar Mar 29 '19 15:03 salanki

@Sonia-Chen Also, even without register files, ignoring the external bandwidth needs, and blindly trusting your numbers, they don't add up: (0.0015+0.0005)*10000 = 20 mm^2, but you claim under 1 mm^2 for "10K sets of that block". You also seem to claim that 30W would be dissipated in that 1 mm^2, ouch. Am I misreading what you meant or is the rest of what you wrote just as unreliable? ;-(

I don't mean to defend ProgPoW here. I actually think ASICs will have significant advantage at it. But we need realistic analyses, and this one looks bogus at least at a high level.

solardiz avatar Mar 29 '19 15:03 solardiz

@Sonia-Chen We have checked the numbers and it seems they are off by a factor of 4.

There are ~33K calls to Merge() and ~18K calls to Math() per hash in ProgPoW 0.9.3.

So the 30W logic can do about 300 MH/s. However, this is still at least 20x higher compute efficiency than a GPU.

tevador avatar Mar 29 '19 15:03 tevador

Oh, maybe I get what was meant by "die increase of <1mm²" - this is probably scaled back from this ASIC's to GPU compute performance at this algorithm. A weird metric, but at least that would make some sense.

solardiz avatar Mar 29 '19 15:03 solardiz

Very concerning to say the least. If this is true then ProgPoW is instantly invalidated and Ethereum is better off by keeping Ethash and maybe applying a smaller tweak every year to make ASICs more flexible and reduce the gap that way via increased cost on the ASIC side (until the switch to PoS).

MoneroCrusher avatar Mar 29 '19 15:03 MoneroCrusher

@MoneroCrusher I'm not sure you're reading this right. What it says is that the advantage of ProgPoW over Ethash is tiny, not that ProgPoW is weaker than Ethash wrt ASICs. Either needs lots of memory bandwidth. It's just that ProgPoW's added limited programmability and computation is probably relatively lightweight when implemented in an ASIC. Perhaps not to the extent of it adding only 1% as claimed here (if I understand this right, now), but nevertheless adding not a lot.

solardiz avatar Mar 29 '19 15:03 solardiz

@solardiz If this ASIC estimate is correct, then ProgPoW is actually worse for GPUs than for ASICs. For example, RX480 consumes +30W when mining ProgPoW compared to Ethash. ASIC will only use about +1 W for the same hashrate compared to Ethash.

tevador avatar Mar 29 '19 16:03 tevador

@solardiz @MoneroCrusher that's right this is only about the advantage. We are not trying to trash ProgPoW, we are trying to study it. The post was in response to an earlier comment about pipelines, and we are focusing on the random program. What's hard to compare is the power consumption difference between asic logic and gpu logic. I will check on some of those mistakes or unclear descriptions pointed out by @solardiz @tevador and @salanki , needs a little time though. Thanks a lot!

Sonia-Chen avatar Mar 29 '19 16:03 Sonia-Chen

@MoneroCrusher I'm not sure you're reading this right. What it says is that the advantage of ProgPoW over Ethash is tiny, not that ProgPoW is weaker than Ethash wrt ASICs. Either needs lots of memory bandwidth. It's just that ProgPoW's added limited programmability and computation is probably relatively lightweight when implemented in an ASIC. Perhaps not to the extent of it adding only 1% as claimed here (if I understand this right, now), but nevertheless adding not a lot.

That is exactly what it was trying to say. I want to look into the feedback wrt Regfile, 4x and 1% calculation though, this can all be improved. Thanks again!

Sonia-Chen avatar Mar 29 '19 16:03 Sonia-Chen

Yes, it adds little overhead to ASICs in comparison to GPUs, hence Ethash is actually better for GPU miners if this is all true.

MoneroCrusher avatar Mar 29 '19 16:03 MoneroCrusher

@tevador @MoneroCrusher You're right. If we factor in the relative power consumption, then yes, ProgPoW's programmability might not look attractive overall. ProgPoW's more efficient external bandwidth usage (with its 256 rather than 128 byte blocks) might fare better.

Thanks @Sonia-Chen. I'll stay tuned. I also have a suggestion: rather than mention virtual/unrealistic compute-only hashrates, pick an external memory bandwidth that is realistic to have in an ASIC and estimate that realistic ASIC's parameters. Compare it to an Ethash ASIC that has as much bandwidth (or is normalized by some other parameter). Compare both against GPUs' hashrates and power consumption at ProgPoW and Ethash, respectively. Yes, this is much more work, maybe more than you volunteered for - but it'd be appreciated.

BTW, I put together and use/hack this C implementation for my experiments - https://github.com/solardiz/c-progpow - you might find it useful too, such as to count how many times some computation is done, etc.

solardiz avatar Mar 29 '19 16:03 solardiz

@Sonia-Chen As I now understand, your company has designed a very impressive Ethash ASIC:

Ethash Miner Announcement, ETC Summit Seoul, September 2018
Specs: Ethash, 1400 MH/s, 1000 Watts, price commitment 4-6 months ROI.
Schedule: 12/2018 TapeOut, 04/2019 Samples, 06/2019 Mass Production.

(BTW, this greatly exceeds ProgPoW designers' expectation that only a ~2x improvement over GPUs would be possible for Ethash.)

I understand you might not want to reveal too much detail prematurely as this is a competitive market, but it'd help the community to know that ASIC unit's total off-die memory bandwidth and across how many ASIC chips it's split. For GPU-like bandwidth per chip, I'm guessing it could be e.g. 50x 10W chips (leaving 500W for the memory), but that's nothing more than a guess to illustrate the question.

It'd be interesting if you managed to avoid the external bandwidth needs by producing everything on a DRAM process. This feels plausible.

It'd be even more interesting if you do use off-die memory, but you managed to reduce the external bandwidth needs per hash computed, which might be possible through doing a huge number of concurrent computations and (un)stalling groups of them in a way that lets you reuse fetches from external memory into on-die cache to advance an entire group of those concurrent computations per fetch. I suggested this in 2014 as a way to turn random access into semi-sequential access (for making use of more than one item per an otherwise unnecessarily wide memory fetch) with sequential memory-hard functions that use tiny lookups, but I think it also applies to parallelizable memory-hard functions such as Ethash (for making use of a fetch more than once). The main trade-off is in needing to maintain internal state for those many concurrent computations, which can become very large and would require plenty of on-die memory. Another trade-off is between the computations state memory and external fetches cache memory. The question is then whether there exists an optimal trade-off point given target hash type, its parameters, and current technology - and what it is.

Once you reveal some properties (such as external bandwidth, die area, hashrate, power) of one ASIC chip in that Ethash unit, it'd be easier for us all to see how much ProgPoW might add to that and whether that's significant added cost or not.

Thanks in advance! ;-)

solardiz avatar Apr 01 '19 14:04 solardiz

@tevador wrote:

There are ~33K calls to Merge() and ~18K calls to Math() per hash in ProgPoW 0.9.3.

I am getting 36864 and 20480. Block 30k:

Digest = 11f19805c58ab46610ff9c719dcf0a5f18fa2f1605798eef770c47219274767d
Merge 36864 total (8192 12288 5120 11264)
Math  20480 total (2048 4096 2048 1024 2048 1024 4096 1024 1024 2048 0)

Block 10M:

Digest = b2403f56c426177856eaf0eedd707c86ae78a432b9169c3689a67058fcf2a848
Merge 36864 total (9216 7168 10240 10240)
Math  20480 total (5120 2048 2048 2048 1024 0 1024 2048 1024 3072 1024)

BTW, quite often a certain math operation is left entirely unused. I think this is OK'ish since it'd be needed again 50 blocks later. But (in a tweak) we'd probably want to add code generation constraints to ensure we're never below a certain ratio of MULs.

solardiz avatar Apr 01 '19 17:04 solardiz

@solardiz I think your numbers are for ProgPoW 0.9.2 which has a higher number of math operations. See the table here: https://github.com/ifdefelse/ProgPOW#progpow-algorithm-walkthrough

tevador avatar Apr 01 '19 19:04 tevador

0.9.3 also has a shorter ProgPoW period of 10 blocks.

salanki avatar Apr 01 '19 19:04 salanki

@tevador Oh, you're right. But this means the test vectors we have in here are also for 0.9.2 only. So I wouldn't be able to easily test my code for computing 0.9.3 correctly just yet. Correct?

solardiz avatar Apr 01 '19 19:04 solardiz

Test vectors for 0.9.3 are yet to be produced AFAIK.

salanki avatar Apr 01 '19 19:04 salanki