go-algorand icon indicating copy to clipboard operation
go-algorand copied to clipboard

AVM: Adding bmodexp

Open mangoplane opened this issue 1 year ago • 24 comments
trafficstars

Summary

Adds the new opcode bmodexp as described in issue #6139 to support modular exponentiation involving byte strings of up to 4096 bytes. Closes #6139

Test Plan

  • Relevant tests added to assembler_test.go, evalStateful_test.go, & eval_test.go
  • Opcode is tested with a range of test vectors with function TestBytesModExp, covering panic cases, edge cases, acceptance cases and failure cases. Test vectors were generated manually or with Python.

mangoplane avatar Sep 21 '24 10:09 mangoplane

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 21 '24 10:09 CLAassistant

Codecov Report

Attention: Patch coverage is 83.72093% with 7 lines in your changes missing coverage. Please review.

Project coverage is 56.26%. Comparing base (39f7485) to head (468210e). Report is 89 commits behind head on master.

Files with missing lines Patch % Lines
data/transactions/logic/opcodes.go 75.86% 4 Missing and 3 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6140      +/-   ##
==========================================
- Coverage   56.28%   56.26%   -0.03%     
==========================================
  Files         494      494              
  Lines       69958    69999      +41     
==========================================
+ Hits        39375    39384       +9     
- Misses      27912    27936      +24     
- Partials     2671     2679       +8     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 21 '24 11:09 codecov[bot]

I have addressed your feedback with a recent update, per my understanding. Let me know if there's anything that I may have misinterpreted.

mangoplane avatar Sep 22 '24 05:09 mangoplane

bmodexp is useful also in different scenarios with smaller inputs so we should not penalize those cost-wise, for instance all ZKP protocols based on elliptic curves use it and they try to operate on the smallest field possible while preserving security for efficiency.

As a concrete example, smart contract verifiers for zk-proofs based on the plonk protocol generated by AlgoPlonk call a teal implementation of it 5 times, using 32-byte inputs for both curve BN254 and BLS12-381.

Considering that a BN254 verifier consumes ~145,000 opcode budget and a BLS12-381 verifier ~185,000, bmodexp would really help bring down that cost.

giuliop avatar Sep 24 '24 08:09 giuliop

I'm doing this from phone, but maxStringSize is the AVM max of 4096? That's unusual for the bmath functions, they normally have a maximum of 128 bytes. I suppose you want more because RSA uses really large keys?

Isn't the maximum 64 bytes?

If we can have a plausible linear model for the smaller inputs currently supported by the b-operations (64 bytes?) which breaks for very large inputs, a solution we might consider for consistency is to offer bmodexp that operates on bigint like the other b-operations and add a separate fixed-cost opcode that operates on 4096 byte strings, e.g. modexpstring

giuliop avatar Sep 24 '24 09:09 giuliop

I'm doing this from phone, but maxStringSize is the AVM max of 4096? That's unusual for the bmath functions, they normally have a maximum of 128 bytes. I suppose you want more because RSA uses really large keys?

Isn't the maximum 64 bytes?

If we can have a plausible linear model for the smaller inputs currently supported by the b-operations (64 bytes?) which breaks for very large inputs, a solution we might consider for consistency is to offer bmodexp that operates on bigint like the other b-operations and add a separate fixed-cost opcode that operates on 4096 byte strings, e.g. modexpstring

You make some good points. However, in my opinion, we should offer a single opcode to reduce complexity. With a sufficiently accurate cost model, such as the one proposed in this GitHub comment, the estimated cost will closely reflect the actual cost within a small margin of error. For example, using the log-polynomial model, the cost for 64 byte long input is 105, which is intuitively reasonable and relatively small. To simplify the cost in that range, we could use a piecewise function where the cost is 105 for all inputs up to 64 bytes in length, and for longer inputs, it follows the advanced cost model. The log-polynomial model also accurately describes the cost for much larger inputs.

Additionally, this seems to align with the long-term vision of allowing byte math opcodes, such as b+, to support up to 4096 bytes each, if I'm not mistaken based on the above discussion.

The opcode should work for inputs up to 4096 bytes, with several test cases exceeding the 64-byte limit. As a sanity check I think it's worth adding another test case to verify the maximum supported length of 4096 bytes.

mangoplane avatar Sep 24 '24 22:09 mangoplane

Just to close the loop, is the suggestion that this: x = exponent_length * max(base_length, modulus_length)**2 will work for the cost function (with appropriate scaling)?

I would be on board with that, and I'll just have to write an "escape hatch" to allow an arbitrary cost function to be written.

jannotti avatar Sep 25 '24 15:09 jannotti

Based on this eval some log-poly formula gives better approximation.

algorandskiy avatar Sep 25 '24 17:09 algorandskiy

Just to close the loop, is the suggestion that this: x = exponent_length * max(base_length, modulus_length)**2 will work for the cost function (with appropriate scaling)?

Looks reasonable to me, the # of iterations is exponent_length and the amount of work performed per iteration is in first approximation modulo(base * base) so proportional to max(base_length, modulus_length)**2

giuliop avatar Sep 25 '24 17:09 giuliop

Just to close the loop, is the suggestion that this: x = exponent_length * max(base_length, modulus_length)**2 will work for the cost function (with appropriate scaling)?

I would be on board with that, and I'll just have to write an "escape hatch" to allow an arbitrary cost function to be written.

I tried out that model after reading EIPS-2565, but I found it highly inaccurate ($R^2$ of 0.5) compared to the log-poly formula ($R^2$ of 0.96)

x = c1*log(C) + c2*log(A)log(C) + c3*log(B)log(C) + c4*log(A)log(B)log(C)

where c1, c2, c3, and c4 are coefficients calculated by linear regression.

I think this is because the log transformation causes any exponents and multiplications to become coefficients and additions, respectively. The algorithm used for modexp is likely very advanced, making use of all the best approaches known.

It's possible that my data isn't accurate, although I don't see how that could be. Perhaps it's worth trying to reproduce my results. I can provide my benchmark code upon request.

mangoplane avatar Sep 25 '24 22:09 mangoplane

I benchmarked bmodexp on my own and exponent_length * max(base_length, modulus_length)**2 looks reasonable to me.

I benchmarked using the same byte length for all three inputs, base, mode, and exp and I get:

Benchmark Iterations Time (ns/op) Extra (extra/op)
BenchmarkBModExp/modexp_32_bytes_inputs-12 78,057 15,019 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_64_bytes_inputs-12 18,218 66,383 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_128_bytes_inputs-12 3,776 322,610 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_256_bytes_inputs-12 556 2,239,315 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_512_bytes_inputs-12 63 18,215,421 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_1024_bytes_inputs-12 8 142,717,656 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_2048_bytes_inputs-12 1 1,147,616,375 ns/op 4.000 extra/op
BenchmarkBModExp/modexp_4096_bytes_inputs-12 1 9,566,156,542 ns/op 4.000 extra/op

If we divide the ns/op by inputs_length**3 we get:

Inputs Len Cost Inputs Len³ Cost / Inputs Len³
32 15,019 32,768 0.46
64 66,383 262,144 0.25
128 322,610 2,097,152 0.15
256 2,239,315 16,777,216 0.13
512 18,215,421 134,217,728 0.14
1024 142,717,656 1,073,741,824 0.13
2048 1,147,616,375 8,589,934,592 0.13
4096 9,566,156,542 68,719,476,736 0.14

Looks like the cost stabilizes after inputs length of 128 bytes. I think we can make the cost proportional to exponent_length * max(base_length, modulus_length)**2 and perhaps add a constant factor to account for the higher relative cost of small inputs

This is the benchmark function I'm using:

func BenchmarkBModExp(b *testing.B) {
	for _, byteLen := range []int{32, 64, 128, 256, 512, 1024, 2048, 4096} {
		var base, exp, mod []byte
		for _, input := range []*[]byte{&base, &exp, &mod} {
			// create random inputs of the given length, we are using math/rand without seeding,
			// so it will generate the same pseudorandom numbers for each run
			*input = make([]byte, byteLen)
			for i := range *input {
				(*input)[i] = byte(rand.Intn(256))
			}
		}
		ops := fmt.Sprintf("byte 0x%x; byte 0x%x; byte 0x%x; bmodexp; pop", base, exp, mod)
		b.Run(fmt.Sprintf("modexp_%d_bytes_inputs", byteLen), func(b *testing.B) {
			benchmarkOperation(b, "", ops, "int 1")
		})
	}
}

giuliop avatar Sep 26 '24 09:09 giuliop

@mangoplane could you commit the benchmarks into crypto_test.go - that's where other cost evaluation benchmarks for crypto opcodes live - I'll try to repro/replay the notebook.

@giuliop how about re-running modexp_1024+ (like -count=64 I guess) to get a better avg value?

algorandskiy avatar Sep 26 '24 19:09 algorandskiy

@giuliop how about re-running modexp_1024+ (like -count=64 I guess) to get a better avg value?

sure, I rerun using -benchtime=64x to have 64 runs (except for 4096 bytes which would timeout, so I ran it 32 times) and the results are in line as before:

Benchmark Trials Time per op (ns/op) Extra ops
BenchmarkBModExp1/modexp_1024_bytes_inputs-12 64 142,368,141 4.000
BenchmarkBModExp1/modexp_2048_bytes_inputs-12 64 1,139,738,240 4.000
BenchmarkBModExp1/modexp_4096_bytes_inputs-12 32 9,443,533,249 4.000

Looks to me like we can use exponent_length * max(base_length, modulus_length)**2 and not make it more complicated than that

giuliop avatar Sep 27 '24 10:09 giuliop

Thanks @giuliop for the insightful benchmarks. I replicated your results when the base length is at least that of the modulus. For smaller bases, the cost is overestimated, explaining the low $R^2$ I had for my test data. Since most bmodexp applications (like RSA) involve base length ≥ modulus length, I agree that exponent_length * max(base_length, modulus_length)**2 is a sufficient approximation.

And @algorandskiy, I have provided my benchmark code to reproduce the results of the notebook. Note that it isn't using a seed, so each run will produce slightly different results, but the overall trend should be the same.

mangoplane avatar Sep 27 '24 12:09 mangoplane

Looking back at benchmark the 4096 byte length case for all three parameters takes 9 sec to run, so it's not feasible on the AVM unfortunately. @jannotti what would you consider the upper limit in terms of how long an opcode can run ?

giuliop avatar Oct 02 '24 08:10 giuliop

I'm about to modify opcodes.go and its dependencies to accommodate a custom cost function. Below is my current plan.

  • Extend the linearCost type to include an instance variable called customCost of type CustomCost.
  • CustomCost is a class that describes a function compute, which calculates the cost based on the stack and is defined on a per-opcode basis. It also has an instance variable—a string that serves as a description of the function for documentation generation.
  • Address the dependencies in the compute function of linearCost, where CustomCost.compute is applied to the input if customCost is defined. Otherwise, it computes the cost as usual. A similar adjustment must be made for linearCost.docCost.

This would involve minimal changes to the codebase, if I'm not mistaken. The opcodes all assume a constant cost, a linear cost based on a single input, or a linear cost dependent on an immediate (enum variables that follow the opcode to describe its configuration). This addition implies that this trend holds, with special exceptions if a CustomCost instance is supplied.

Interested in hearing everyone's thoughts on this approach. If people think it's a good idea, I am happy to proceed with its implementation.

mangoplane avatar Oct 02 '24 10:10 mangoplane

Looking back at benchmark the 4096 byte length case for all three parameters takes 9 sec to run, so it's not feasible on the AVM unfortunately. @jannotti what would you consider the upper limit in terms of how long an opcode can run ?

That's a problem! We aim for 10,000 txn/sec! The most you can execute in a single txn is 16x20,000 "ticks" (by pooling logicsigs). An "tick" is about 15ns. So that gives means a single program can run for about 360,000x15e-9 so something like 3.6*15e-4 = .0054 secs.

So, is it worth getting the cost function "right" when that simply means that large operations will be (rightly) impossible, or should we drastically limit the lengths, and perhaps use a simpler cost. Either way, operations on really large keys seems impossible.

All is not lost, RSA keys seem to be at most 4k BITS, not BYTES. Does that end up being 8^3 cheaper because of the squaring of the multiplicand length, and the factor of 8 from the exponent?

9/512 = 0.017 which starts to become almost feasible.

jannotti avatar Oct 03 '24 19:10 jannotti

I reproduced the results and got the log-poly model fits my data points perfectly

log_A coefficient: 0.0
log_B coefficient: 0.30924400959422704
log_C coefficient: 1.2483621313283038
log_A_log_B coefficient: 0.0
log_A_log_C coefficient: 0.0
log_B_log_C coefficient: 0.10928549552529052
log_A_log_B_log_C coefficient: 8.62725548783673e-05
Intercept: 0.0
R-squared (on log scale): 0.9964
Root Mean Squared Error: 98143.7211 gas units

image

algorandskiy avatar Oct 03 '24 20:10 algorandskiy

All is not lost, RSA keys seem to be at most 4k BITS, not BYTES. Does that end up being 8^3 cheaper because of the squaring of the multiplicand length, and the factor of 8 from the exponent?

9/512 = 0.017 which starts to become almost feasible.

Indeed the benchmarking at 512 bytes length for all parameters gives 0.018 sec

Inputs Len (bytes) ns / op Inputs Len³ ns / Inputs Len³
32 15,019 32,768 0.46
64 66,383 262,144 0.25
128 322,610 2,097,152 0.15
256 2,239,315 16,777,216 0.13
512 18,215,421 134,217,728 0.14

That means though that 512 bytes is still too large given that the max for a single program is 20,000 * 16 * 15ns = 0.0048 sec , so the max length for bmodexp's parameters can be 256 bytes.

This assumes we want to cap all parameters at the same size. @mangoplane for your use case what is the max size of each of the three parameters?

If the assumption that we want to cap all parameters at the same size holds, I guess we have two options:

  1. Introduce bmodexp for up to 64 bytes parameters for consistency with all other b-operations
  2. Introduce bmodexp for up to 256 bytes parameters (maybe calling it modexp)

I would suggest to do option 1 now and then do the work to graduate all b-operations to 256 bytes parameters

giuliop avatar Oct 06 '24 16:10 giuliop

For the cost function I would use this formula to determine the actual opcode cost, i.e., the number of ticks:

max(base_lenght, mod_lenght)^1.63 * exp_length / 15 + 200

where lenght is the # of bytes.

This is how the benchmarking looks like, here I am keeping base and mod at the same byte length and varying the exp length. I am calculating the # of ticks by dividing the ns/op coming from go test by 15.

Base & Mod Len Exp Len Average ns/op Average 'ticks' Formula 'ticks'
32 16 6,445 430 503
32 24 9,274 618 654
32 32 12,665 844 806
64 16 18,432 1,229 1,138
64 32 33,163 2,211 2,076
64 48 52,550 3,503 3,013
64 64 70,248 4,683 3,951
128 32 82,731 5,515 6,005
128 64 160,046 10,670 11,810
128 96 239,196 15,946 17,615
128 128 315,648 21,043 23,420
256 64 579,067 38,604 36,135
256 128 1,166,991 77,799 72,070
256 192 1,652,136 110,142 108,006
256 256 2,121,371 141,425 143,941

This is the function used for the benchmarking:

func BenchmarkBModExp2(b *testing.B) {
    // Define the base and mod lengths, and corresponding exp lengths
    for _, byteLen := range []int{32, 64, 128, 256} {
        var expLens []int
        switch byteLen {
        case 32:
            expLens = []int{16, 24, 32}
        case 64:
            expLens = []int{16, 32, 48, 64}
        case 128:
            expLens = []int{32, 64, 96, 128}
        case 256:
            expLens = []int{64, 128, 192, 256}
        }

        for _, expLen := range expLens {
            // Generate base and mod of length byteLen
            base := make([]byte, byteLen)
            mod := make([]byte, byteLen)
            for _, input := range []*[]byte{&base, &mod} {
                for i := range *input {
                    (*input)[i] = byte(mathrand.Intn(256))
                }
            }

            // Generate exp of varying lengths
            exp := make([]byte, expLen)
            for i := range exp {
                exp[i] = byte(mathrand.Intn(256))
            }

            ops := fmt.Sprintf("byte 0x%x; byte 0x%x; byte 0x%x; bmodexp; pop", base, exp, mod)

            b.Run(fmt.Sprintf("modexp_base&mod=%d_bytes_exp=%d_bytes", byteLen, expLen), func(b *testing.B) {
                benchmarkOperation(b, "", ops, "int 1")
            })
        }
    }
}

giuliop avatar Oct 06 '24 17:10 giuliop

All is not lost, RSA keys seem to be at most 4k BITS, not BYTES. Does that end up being 8^3 cheaper because of the squaring of the multiplicand length, and the factor of 8 from the exponent? 9/512 = 0.017 which starts to become almost feasible.

Indeed the benchmarking at 512 bytes length for all parameters gives 0.018 sec Inputs Len (bytes) ns / op Inputs Len³ ns / Inputs Len³ 32 15,019 32,768 0.46 64 66,383 262,144 0.25 128 322,610 2,097,152 0.15 256 2,239,315 16,777,216 0.13 512 18,215,421 134,217,728 0.14

That means though that 512 bytes is still too large given that the max for a single program is 20,000 * 16 * 15ns = 0.0048 sec , so the max length for bmodexp's parameters can be 256 bytes.

This assumes we want to cap all parameters at the same size. @mangoplane for your use case what is the max size of each of the three parameters?

If the assumption that we want to cap all parameters at the same size holds, I guess we have two options:

1. Introduce `bmodexp` for up to 64 bytes parameters for consistency with all other b-operations

2. Introduce `bmodexp` for up to 256 bytes parameters (maybe calling it `modexp`)

I would suggest to do option 1 now and then do the work to graduate all b-operations to 256 bytes parameters

Thanks for your input and feedback everyone. I have what I believe is a solution that satisfies the tick limit, without ruling out many useful applications of the new opcode. Interested to hear your thoughts @jannotti @giuliop @algorandskiy :

In practice, it's rare for the exponent in bmodexp to have a large length, as seen in applications like RSA. For example, in RS256, the exponent length is 3 bytes. Furthermore, it seems there's no need to necessarily cap the length to account for time if the cost model is accurate. For instance, the dynamic cost budget would be exceeded and capped at 16 × 20,000 ticks before it surpasses the time limit. Having these large limits combined with this cost model allows developers to allocate lengths as they see fit, fitting them within the cost constraints. In the case of RS256, the exponent would be set small while the other arguments are large, ensuring they fit well within the budget as estimated by the cost model and, therefore, within the execution time limit.

Therefore, I propose option 3:

  1. Introduce bmodexp for up to 1,024-byte inputs, paired with a robust cost model that prevents exceeding the tick limit. This nuanced approach places the responsibility on developers to adjust input lengths until the cost is low enough to fit within 16 LSIGs, ensuring they remain below the maximum tick length. This flexibility makes RSA viable, given that the exponent is only 3 bytes in RS256

Another feasible option is to set the limits low for bmodexp but to have a specialized RS256 opcode that sets the exponent low and allows larger modulus and signature inputs than can be accommodated by bmodexp.

mangoplane avatar Oct 08 '24 23:10 mangoplane

I think option 3 can work, developers can always check at runtime if needed the sizes of their inputs and manage in the code the cases where the opcode budget would be exceeded

giuliop avatar Oct 09 '24 07:10 giuliop

I think option 3 can work, developers can always check at runtime if needed the sizes of their inputs and manage in the code the cases where the opcode budget would be exceeded

That sounds like a plan. I'll get to work implementing the cost model, following my earlier suggestion involving adding a field called customCost to linearCost, optionally containing a customCost function, which should work and be reasonably clean code.

mangoplane avatar Oct 11 '24 19:10 mangoplane

Hey there,

I've pushed my changes to handle non-linear opcode costs, following the plan I outlined earlier. The tests were extended to also verify that the cost is accurately calculated. I performed a sanity check for documentation generation and confirmed that the docs were generated without errors, containing the following information:

TEAL_opcodes_v11.md sanity check:

## bmodexp

- Bytecode: 0xe6
- Stack: ..., A: []byte, B: []byte, C: []byte → ..., []byte
- A raised to the Bth power modulo C. A, B and C are interpreted as big-endian unsigned integers limited to 4096 bytes. Fail if C is zero.
- **Cost**: ((len(B) * max(len(A), len(C)) ^ 1.63) / 15) + 200
- Availability: v11

langspec_v11.json sanity check:

{
"Opcode": 230,
"Name": "bmodexp",
"Args": [
"[]byte",
"[]byte",
"[]byte"
],
"Returns": [
"[]byte"
],
"Size": 1,
"DocCost": "((len(B) * max(len(A), len(C)) ^ 1.63) / 15) + 200",
"Doc": "A raised to the Bth power modulo C. A, B and C are interpreted as big-endian unsigned integers limited to 4096 bytes. Fail if C is zero.",
"IntroducedVersion": 11,
"Groups": [
"Byte Array Arithmetic"
]
}

I thought it might be worthwhile to leave the bmodexp argument size limits as they are, considering it should be impossible to exceed the time limit with the current cost model, where the responsibility lies with the developer to allocate sizes within the budget, as discussed. Let me know your thoughts.

mangoplane avatar Oct 14 '24 03:10 mangoplane

@jannotti I was wondering if wouldn't be appropriate to have the proposed op code "cost function" parameters in the consensus configuration (instead of hardcoding them in opcode.go - https://github.com/algorand/go-algorand/pull/6140/commits/468210e44adbd29a6dca7acc2d342707eb8cb2b4).

cusma avatar Nov 15 '24 14:11 cusma

@jannotti I was wondering if wouldn't be appropriate to have the proposed op code "cost function" parameters in the consensus configuration (instead of hardcoding them in opcode.go - 468210e).

We need to be able to tie cost functions to AVM versions, rather than consensus versions, so that existing code does not change behavior. That's why all the costs are set in logic package, basis on the version of bytecode running.

jannotti avatar Nov 15 '24 14:11 jannotti

We need to be able to tie cost functions to AVM versions, rather than consensus versions, so that existing code does not change behavior. That's why all the costs are set in logic package, basis on the version of bytecode running.

Right! My bad, I forgot about the decoupled bytecode version requirement. Thanks for clarifying!

cusma avatar Nov 15 '24 14:11 cusma

@mangoplane could you merge master in?

algorandskiy avatar Mar 08 '25 19:03 algorandskiy

@mangoplane could you merge master in?

I created a larger PR (#6270) addressing some related opcodes, and using the very nice custom cost function infrastructure added by @mangoplane .

I'll close this and handle merging master into that one.

jannotti avatar Mar 10 '25 14:03 jannotti