miden-vm
miden-vm copied to clipboard
Improved design for power of two co-processor
The power of two operation currently requires 24 constraints (23 after #199). As noted by @bobbinth:
24 constraints is a lot. Granted, these constraints are quite simple, but still for a single pow2 operation feels like a bit of a waste.
Improving this significantly requires coming up with a new construction for the power of two co-processor. The current design is described here and the implementation is located here.
Another note from @bobbinth:
there might be other ways of computing a power of two which may be outside of the co-processor structure like it may be done at the expense of one extra column - but we should benchmark it to see if that's actually better
One other option: if we can simulate power of two computation using already existing instructions in something like 10 - 20 cycles, I think we can just move the pow2
operation to the standard library and get rid of the co-processor entirely.
After thinking about it, I've come up with a stdlib script which can be used for pow2 computation. The script takes 40 VM cycles.
export.pow2
# decompose the number to it's binary representation. Eg. 59 will be represent by
# 1100111 (most significant bit on the right most side or deeper in the stack)
push.32
u32div.unsafe
push.16
u32div.unsafe
push.8
u32div.unsafe
push.4
u32div.unsafe
push.2
u32div.unsafe
# aggregate the bits from the binary representation to form power of 2 of the original input.
# 2^59 can be aggregated using: ((2^0 - 1)*1 + 1) * ((2^1 - 1)*1 + 1) * ((2^2 - 1)*0 + 1) *
# ((2^3 - 1)*0 + 1) * ((2^4 - 1)*1 + 1) * ((2^5 - 1)*1 + 1)
push.1
swap
add.1
mul
swap
push.2*2 - 1
mul
add.1
mul
swap
push.2*4 - 1
mul
add.1
mul
swap
push.2*8 - 1
mul
add.1
mul
swap
push.2**16 - 1
mul
add.1
mul
swap
push.2**32 - 1
mul
add.1
mul
end
Decomposing a number to it's binary representation takes 10 cycles (not sure if it can be improved) and aggregation takes another 30 cycles.
Discussing this with @bobbinth, I determined that we can make use of a VM instruction which reduces the total operation to 9 VM cycles. The binacc
operation in the previous version of miden VM does something close to what we need, and the design is a bit similar.
The pseudo code for both updated pow2
opcode and modified binacc
instruction are as follows:
pow2
opcode: Eg. pow2.24
func parse_pow2(span_ops, op)
span_ops.push(Operation::Push(Felt::new(1)))
span_ops.push(Operation::Push(Felt::new(32)))
span_ops.push(Operation::Push(imm)
span_ops.push(Operation::Binacc)
span_ops.push(Operation::Binacc)
span_ops.push(Operation::Binacc)
span_ops.push(Operation::Binacc)
span_ops.push(Operation::Binacc)
span_ops.push(Operation::drop)
span_ops.push(Operation::drop)
binacc
VM instruction pseudo code:
func op_binacc(stack):
hashmap (stores power of 2 on 1, 2, 4, 8, 16, 32)
power = stack[0]
divisor = stack[1]
acc = stack[2]
q = power / divisor
r = power % divisor
if divisor is equal to 2:
acc = acc * ((hashmap(divisor) - 1) * q + 1) * (r + 1)
else:
acc = acc * ((hashmap(divisor) - 1) * q + 1)
power = r
divisor = divisor / 2
stack[0] = power
stack[1] = divisor
stack[2] = acc
Would love to know team's opinion on the above approach.
Thank you! My opinion is that 44 cycles is probably too many, and in the long term I favor the binacc
approach. One question is whether we should do it in stages and first implement the pow2
procedure using the simpler but less efficient approach. And once this is done, we can update this to a more efficient approach. Specifically, the steps could be:
- Update
pow2
assembly parser to use the 44 cycle procedure. - Remove
pow2
VM operation and the associated co-processor. - Implement
binacc
(or similar) VM operation. - Update
pow2
assembly parser with the more efficient implementation.
Regarding the binacc
operation - I'm not sure it will work as described above. Specifically, how would we describe hashmap(divisor)
in AIR? Unless I'm missing something, it seems like this operation maps a value to a power of two, which is what we are trying to implement in the first place.
In my mind, the binacc
operation would probably need to first push the binary decomposition of the value on the stack onto the advice tape (we'll need a new decorator for this), and then read bits from the advice tape one by one performing the necessary aggregation.
Thanks for your feedback! I agree with you on using modified binacc
approach as a long term solution.
One question is whether we should do it in stages and first implement the pow2 procedure using the simpler but less efficient approach
I think it makes sense that we divide it into parts rather than contributing all of it in one go. Coding/Reviewing would be easier this way.
Regarding the
binacc
operation - I'm not sure it will work as described above. Specifically, how would we describe hashmap(divisor) in AIR?
We need pow2 on 2, 4, 8,16 & 32 when we are aggregating the bits. Given, that these values are deterministic, one solution that I can think of right now is to keep them in the stack and track which pow2 should be picked in the current cycle. This might though increase the total number of VM cycles as we have to push & pop more elements from the stacks.
In my mind, the binacc operation would probably need to first push the binary decomposition of the value on the stack onto the advice tape (we'll need a new decorator for this), and then read bits from the advice tape one by one performing the necessary aggregation.
Thanks for the suggestion. It was handled similarly in v0.1. I will follow it in my implementation.
Discussion this with @bobbinth, the updated implementation of pow2
opcode and modified binacc
instruction are as follows:
function op_binacc(stack, tape)
bit = tape.pop()
exponent = stack[0]
acc = stack[1]
next_power = stack[2]
b = stack[3]
acc = acc * ((exponent - 1) * bit + 1)
b = b + bit * next_power
exponent = exponent * exponent
next_power = next_power * 2
h0 = bit
stack[0] = exponent
stack[1] = acc
stack[2] = next_power
stack[3] = b
function pow2(spans)
spans.push(0) # b
spans.push(dup)
spans.push(Incr) # next_power(2^0)
spans.push(dup) # pow2_acc
spans.push(dup)
spans.push(Incr) # exponent(2^(2^0))
spans.push(binacc)
spans.push(binacc)
spans.push(binacc)
spans.push(binacc)
spans.push(binacc)
spans.push(binacc)
spans.push(drop)
spans.push(movdn3)
spans.push(drop)
spans.push(eq)
spans.push(assert)
The instruction can now, therefore, be used to check whether a value fits into a certain number of bits or not. Though, pow2_acc will be undefined after some iteration of the modified binaac
.
Thank you! A couple of thoughts:
- We could probably put
bit
values into one of the helper registers (e.g.,h0
), this way only the first 4 elements of the stack are affected bybinacc
. - I think the sequence of
push
operations upfront is reversed. With this sequenceb
would end up on top of the stack. Also,push
operations are relatively expensive - so, we should try to minimize their number. At the very least, instead of pushing1
twice, it could bepush(1)
and thendup
. - We probably need a new decorator which takes a binary decomposition of the value on top of the stack and pushes it onto the advice type. This won't affect cycle count - but we should still add it here.
As the next step, it would be good to write out AIR constraint required for binacc
. Most of them would flow directly from your pseudocode, but there are a few extra (e.g., making sure that bit value is 0 or 1).
Thanks @bobbinth for the review!
We could probably put bit values into one of the helper registers (e.g., h0), this way only the first 4 elements of the stack are affected by binacc.
I'll keep this in mind in the implementation.
I think the sequence of push operations upfront is reversed. With this sequence b would end up on top of the stack. Also, push operations are relatively expensive - so, we should try to minimize their number. At the very least, instead of pushing 1 twice, it could be push(1) and then dup.
Agreed! Apologies for the confusion. I've updated the pseudo code to incorporate this. Meanwhile, I've replaced the push(1)
with dup
as you suggested. Post this operation we had an operation where we were pushing 2. I've replaced it with dup
followed by Incr
. Let me know if we should revert to the original push.2
here.
We probably need a new decorator which takes a binary decomposition of the value on top of the stack and pushes it onto the advice type. This won't affect cycle count - but we should still add it here.
Understood. I will update the pseudo code with the new decoder implementation.
As the next step, it would be good to write out AIR constraint required for binacc. Most of them would flow directly from your pseudocode, but there are a few extra (e.g., making sure that bit value is 0 or 1).
Sure. I will come up with the AIR constraints for binaac
soon.
Thank you! It seems like the procedure is taking ~17 cycles. So, I guess the question is whether this is a good enough of an improvement over the procedure which uses ~40 cycles to justify having the binacc
operation.
Binacc Air Constraints
I've adopted the following notation to describe AIR constraints: for column $z$, we represent the value in the current row & next row as $z$ and $z'$ respectively. $s_0 ... s_{15}$ are the columns representing the top $16$ slots of the stack.
The helper column value should remain binary.
$$ f_{binacc} \cdot (h_{0}^2 - h_{0}) = 0 $$
The transition constraint for exp
or $s_{0}$ is as follows:
$$ f_{binacc} \cdot (s_{0}' - s_{0} * s_{0}) = 0 $$
The transition constraint for acc
or $s_{1}$ is as follows:
$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0}' + 1 )) = 0 $$
The transition constraint for next_pow
or $s_{2}$ is as follows:
$$ f_{binacc} \cdot (s_{2}' - s_{2} * 2) = 0 $$
The transition constraint for b
or $s_{3}$ is as follows:
$$ f_{binacc} \cdot (s_{3}' - s_{2} * h_{0}' - s_{3}) = 0 $$
The transition constraint for the rest of the columns whose values should remain the same are as follows:
$$ f_{binacc} \cdot (s_{i}' - s_{i}) = 0 \ \text{ for } i \in {4, .., 15} $$
The transition constraint for
acc
or $s_{1}$ is as follows:
$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0}' + 1 )) = 0 $$
The degree of this constraint is coming out to be 10. I think the max allowed degree was 9? If it is the case then we will have to break this down into two formulas which will increase the VM cycle count by at least 2(push
& drop
).
Another way around it is if $f_{binacc}$ has degree less than $7$. But we'll need to see if we actually have enough of such operations available.
Thank you! It seems like the procedure is taking ~17 cycles. So, I guess the question is whether this is a good enough of an improvement over the procedure which uses ~40 cycles to justify having the binacc operation.
I was thinking of taking the value we are trying to raise 2 to as an immediate value to pow2
opcode & planning to raise an error if the immediate value in the assembly is anything but 0 ... 63. Given it's costly and we should avoid it, the new VM cycles for both safe(to check if the top value is less than 64 or not) and unsafe option where the value is at the top of the stack are as follows:
Safe: 46 cycles
Unsafe: 40 cycles
Meanwhile the binacc
operation can also be used to determine if the value can be represented by n number of bits.
Thinking about it a bit more, I believe we can simplify things a bit by changing slightly how binacc
works.
Assume the stack is in this state:
[0, exp, acc, a]
Applying binacc
to this state could result in the follwing:
[b, exp * exp, acc * ((exp - 1) * bit + 1), a >> 1]
Where $b$ is the least significant bit of $a$. The constraint for a' = a >> 1
could be $a' \cdot 2 + b = a$ - so, we don't need to waste a division to check this.
If this approach works, we could save a couple of cycles as we only need to verify that the final value of $a$ is 0.
We could also put bit values in h0
- this way, we should save a couple of cycles on PUSH
/DROP
.
I guess we then wouldn't need to do a binary decomposition of the value on top of the stack?
An high level implementation & AIR constraints of the above approach are as follows:
function op_Binacc(stack)
exponent = stack[0]
acc = stack[1]
b = stack[2]
bit = b & 1
acc = acc * ((exponent - 1) * bit + 1)
b = b >> 1
exponent = exponent * exponent
h0 = bit
stack[0] = exponent
stack[1] = acc
stack[2] = b
function pow2(spans)
spans.push(Pad)
spans.push(Incr) # acc = 1
spans.push(Dup)
spans.push(Incr) # exp = 2
spans.push(Binacc)
spans.push(Binacc)
spans.push(Binacc)
spans.push(Binacc)
spans.push(Binacc)
spans.push(Binacc)
spans.push(Drop)
spans.push(Swap)
spans.push(Eqz)
spans.push(Assert)
It will now take 14 VM cycles with no push
operation.
Binacc Air Constraints
I've adopted the following notation to describe AIR constraints: for column $z$, we represent the value in the current row & next row as $z$ and $z'$ respectively. $s_0 ... s_{15}$ are the columns representing the top $16$ slots of the stack.
The helper column value should remain binary.
$$ f_{binacc} \cdot (h_{0}^2 - h_{0}) = 0 $$
The transition constraint for exp
or $s_{0}$ is as follows:
$$ f_{binacc} \cdot (s_{0}' - s_{0} * s_{0}) = 0 $$
The transition constraint for acc
or $s_{1}$ is as follows:
$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0} + 1 )) = 0 $$
The transition constraint for a
or $s_{2}$ is as follows:
$$ f_{binacc} \cdot (2 \cdot s_{2}' + h_{0} - s_{2}) = 0 $$
The transition constraint for the rest of the columns whose values should remain the same are as follows:
$$ f_{binacc} \cdot (s_{i}' - s_{i}) = 0 \ \text{ for } i \in {3, .., 15} $$
Looks good! By convention, instructions can access helper registers in the current cycle (rather than the next). So, we should just replace $h_0'$ with $h_0$ in the above.
The task can be broken down into the following sequential parts:
- [X] Update pow2 assembly parser to use the 38 cycle procedure.
- [x] Remove pow2 VM operation and the associated co-processor.
- [ ] Implement binacc (or similar) VM operation.
- [ ] Update pow2 assembly parser with the more efficient implementation.
- [ ] Update mdbook
@bobbinth or @0xKanekiKen what's the status of this issue? The binacc PR has been open for a while. Is this change on hold?
Yeah - for now I'm not sure if we need a dedicated binacc
operation.
Closed by #420.