miden-vm icon indicating copy to clipboard operation
miden-vm copied to clipboard

Improved design for power of two co-processor

Open grjte opened this issue 2 years ago • 17 comments

The power of two operation currently requires 24 constraints (23 after #199). As noted by @bobbinth:

24 constraints is a lot. Granted, these constraints are quite simple, but still for a single pow2 operation feels like a bit of a waste.

Improving this significantly requires coming up with a new construction for the power of two co-processor. The current design is described here and the implementation is located here.

grjte avatar May 13 '22 06:05 grjte

Another note from @bobbinth:

there might be other ways of computing a power of two which may be outside of the co-processor structure like it may be done at the expense of one extra column - but we should benchmark it to see if that's actually better

grjte avatar May 13 '22 06:05 grjte

One other option: if we can simulate power of two computation using already existing instructions in something like 10 - 20 cycles, I think we can just move the pow2 operation to the standard library and get rid of the co-processor entirely.

bobbinth avatar Jun 23 '22 06:06 bobbinth

After thinking about it, I've come up with a stdlib script which can be used for pow2 computation. The script takes 40 VM cycles.

export.pow2

    # decompose the number to it's binary representation. Eg. 59 will be represent by 
    # 1100111 (most significant bit on the right most side or deeper in the stack)
    
    push.32
    u32div.unsafe
    push.16
    u32div.unsafe
    push.8
    u32div.unsafe
    push.4
    u32div.unsafe
    push.2
    u32div.unsafe
    
    # aggregate the bits from the binary representation to form power of 2 of the original input. 
    # 2^59 can be aggregated using: ((2^0 - 1)*1 + 1) * ((2^1 - 1)*1 + 1) * ((2^2 - 1)*0 + 1) * 
    # ((2^3 - 1)*0 + 1) * ((2^4 - 1)*1 + 1) * ((2^5 - 1)*1 + 1)
    
    push.1
    swap
    add.1
    mul
    swap
    push.2*2 - 1
    mul
    add.1
    mul
    swap
    push.2*4 - 1
    mul
    add.1
    mul
    swap
    push.2*8 - 1
    mul
    add.1
    mul
    swap
    push.2**16 - 1
    mul
    add.1
    mul
    swap
    push.2**32 - 1
    mul
    add.1
    mul
end 

Decomposing a number to it's binary representation takes 10 cycles (not sure if it can be improved) and aggregation takes another 30 cycles.

Discussing this with @bobbinth, I determined that we can make use of a VM instruction which reduces the total operation to 9 VM cycles. The binacc operation in the previous version of miden VM does something close to what we need, and the design is a bit similar.

The pseudo code for both updated pow2 opcode and modified binacc instruction are as follows:

pow2 opcode: Eg. pow2.24

func parse_pow2(span_ops, op)
    span_ops.push(Operation::Push(Felt::new(1)))
    span_ops.push(Operation::Push(Felt::new(32)))
    span_ops.push(Operation::Push(imm)
    span_ops.push(Operation::Binacc)
    span_ops.push(Operation::Binacc)
    span_ops.push(Operation::Binacc)
    span_ops.push(Operation::Binacc)
    span_ops.push(Operation::Binacc)
    span_ops.push(Operation::drop)
    span_ops.push(Operation::drop)

binacc VM instruction pseudo code:

func op_binacc(stack):

    hashmap (stores power of 2 on 1, 2, 4, 8, 16, 32)
    
    power = stack[0]
    divisor = stack[1]
    acc = stack[2]
    
    q = power / divisor
    r = power % divisor
    
    if divisor is equal to 2:
        acc = acc * ((hashmap(divisor) - 1) * q + 1) * (r + 1) 
    else:
        acc = acc * ((hashmap(divisor) - 1) * q + 1)
    power = r
    divisor = divisor / 2
    
    stack[0] = power
    stack[1] = divisor
    stack[2] = acc

Would love to know team's opinion on the above approach.

0xkanekiken avatar Jun 27 '22 06:06 0xkanekiken

Thank you! My opinion is that 44 cycles is probably too many, and in the long term I favor the binacc approach. One question is whether we should do it in stages and first implement the pow2 procedure using the simpler but less efficient approach. And once this is done, we can update this to a more efficient approach. Specifically, the steps could be:

  1. Update pow2 assembly parser to use the 44 cycle procedure.
  2. Remove pow2 VM operation and the associated co-processor.
  3. Implement binacc (or similar) VM operation.
  4. Update pow2 assembly parser with the more efficient implementation.

Regarding the binacc operation - I'm not sure it will work as described above. Specifically, how would we describe hashmap(divisor) in AIR? Unless I'm missing something, it seems like this operation maps a value to a power of two, which is what we are trying to implement in the first place.

In my mind, the binacc operation would probably need to first push the binary decomposition of the value on the stack onto the advice tape (we'll need a new decorator for this), and then read bits from the advice tape one by one performing the necessary aggregation.

bobbinth avatar Jun 27 '22 19:06 bobbinth

Thanks for your feedback! I agree with you on using modified binacc approach as a long term solution.

One question is whether we should do it in stages and first implement the pow2 procedure using the simpler but less efficient approach

I think it makes sense that we divide it into parts rather than contributing all of it in one go. Coding/Reviewing would be easier this way.

Regarding the binacc operation - I'm not sure it will work as described above. Specifically, how would we describe hashmap(divisor) in AIR?

We need pow2 on 2, 4, 8,16 & 32 when we are aggregating the bits. Given, that these values are deterministic, one solution that I can think of right now is to keep them in the stack and track which pow2 should be picked in the current cycle. This might though increase the total number of VM cycles as we have to push & pop more elements from the stacks.

In my mind, the binacc operation would probably need to first push the binary decomposition of the value on the stack onto the advice tape (we'll need a new decorator for this), and then read bits from the advice tape one by one performing the necessary aggregation.

Thanks for the suggestion. It was handled similarly in v0.1. I will follow it in my implementation.

0xkanekiken avatar Jun 27 '22 22:06 0xkanekiken

Discussion this with @bobbinth, the updated implementation of pow2 opcode and modified binacc instruction are as follows:

function op_binacc(stack, tape)
    bit = tape.pop()
    exponent = stack[0]
    acc = stack[1]
    next_power = stack[2]
    b = stack[3]

    acc = acc * ((exponent - 1) * bit + 1)
    b = b + bit * next_power

    exponent = exponent * exponent
    next_power = next_power * 2
    
    h0 = bit
    stack[0] = exponent
    stack[1] = acc
    stack[2] = next_power
    stack[3] = b
    
function pow2(spans)
    spans.push(0) # b
    spans.push(dup)
    spans.push(Incr) # next_power(2^0)
    spans.push(dup) # pow2_acc 
    spans.push(dup)
    spans.push(Incr) # exponent(2^(2^0))
    spans.push(binacc)
    spans.push(binacc)
    spans.push(binacc)
    spans.push(binacc)
    spans.push(binacc)
    spans.push(binacc)
    spans.push(drop)
    spans.push(movdn3)
    spans.push(drop)
    spans.push(eq)
    spans.push(assert)

The instruction can now, therefore, be used to check whether a value fits into a certain number of bits or not. Though, pow2_acc will be undefined after some iteration of the modified binaac.

0xkanekiken avatar Jun 30 '22 11:06 0xkanekiken

Thank you! A couple of thoughts:

  1. We could probably put bit values into one of the helper registers (e.g., h0), this way only the first 4 elements of the stack are affected by binacc.
  2. I think the sequence of push operations upfront is reversed. With this sequence b would end up on top of the stack. Also, push operations are relatively expensive - so, we should try to minimize their number. At the very least, instead of pushing 1 twice, it could be push(1) and then dup.
  3. We probably need a new decorator which takes a binary decomposition of the value on top of the stack and pushes it onto the advice type. This won't affect cycle count - but we should still add it here.

As the next step, it would be good to write out AIR constraint required for binacc. Most of them would flow directly from your pseudocode, but there are a few extra (e.g., making sure that bit value is 0 or 1).

bobbinth avatar Jun 30 '22 15:06 bobbinth

Thanks @bobbinth for the review!

We could probably put bit values into one of the helper registers (e.g., h0), this way only the first 4 elements of the stack are affected by binacc.

I'll keep this in mind in the implementation.

I think the sequence of push operations upfront is reversed. With this sequence b would end up on top of the stack. Also, push operations are relatively expensive - so, we should try to minimize their number. At the very least, instead of pushing 1 twice, it could be push(1) and then dup.

Agreed! Apologies for the confusion. I've updated the pseudo code to incorporate this. Meanwhile, I've replaced the push(1) with dup as you suggested. Post this operation we had an operation where we were pushing 2. I've replaced it with dup followed by Incr. Let me know if we should revert to the original push.2 here.

We probably need a new decorator which takes a binary decomposition of the value on top of the stack and pushes it onto the advice type. This won't affect cycle count - but we should still add it here.

Understood. I will update the pseudo code with the new decoder implementation.

As the next step, it would be good to write out AIR constraint required for binacc. Most of them would flow directly from your pseudocode, but there are a few extra (e.g., making sure that bit value is 0 or 1).

Sure. I will come up with the AIR constraints for binaac soon.

0xkanekiken avatar Jun 30 '22 17:06 0xkanekiken

Thank you! It seems like the procedure is taking ~17 cycles. So, I guess the question is whether this is a good enough of an improvement over the procedure which uses ~40 cycles to justify having the binacc operation.

bobbinth avatar Jun 30 '22 21:06 bobbinth

Binacc Air Constraints

I've adopted the following notation to describe AIR constraints: for column $z$, we represent the value in the current row & next row as $z$ and $z'$ respectively. $s_0 ... s_{15}$ are the columns representing the top $16$ slots of the stack.

The helper column value should remain binary.

$$ f_{binacc} \cdot (h_{0}^2 - h_{0}) = 0 $$

The transition constraint for exp or $s_{0}$ is as follows:

$$ f_{binacc} \cdot (s_{0}' - s_{0} * s_{0}) = 0 $$

The transition constraint for acc or $s_{1}$ is as follows:

$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0}' + 1 )) = 0 $$

The transition constraint for next_pow or $s_{2}$ is as follows:

$$ f_{binacc} \cdot (s_{2}' - s_{2} * 2) = 0 $$

The transition constraint for b or $s_{3}$ is as follows:

$$ f_{binacc} \cdot (s_{3}' - s_{2} * h_{0}' - s_{3}) = 0 $$

The transition constraint for the rest of the columns whose values should remain the same are as follows:

$$ f_{binacc} \cdot (s_{i}' - s_{i}) = 0 \ \text{ for } i \in {4, .., 15} $$

0xkanekiken avatar Jun 30 '22 21:06 0xkanekiken

The transition constraint for acc or $s_{1}$ is as follows:

$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0}' + 1 )) = 0 $$

The degree of this constraint is coming out to be 10. I think the max allowed degree was 9? If it is the case then we will have to break this down into two formulas which will increase the VM cycle count by at least 2(push & drop).

0xkanekiken avatar Jun 30 '22 21:06 0xkanekiken

Another way around it is if $f_{binacc}$ has degree less than $7$. But we'll need to see if we actually have enough of such operations available.

bobbinth avatar Jun 30 '22 21:06 bobbinth

Thank you! It seems like the procedure is taking ~17 cycles. So, I guess the question is whether this is a good enough of an improvement over the procedure which uses ~40 cycles to justify having the binacc operation.

I was thinking of taking the value we are trying to raise 2 to as an immediate value to pow2 opcode & planning to raise an error if the immediate value in the assembly is anything but 0 ... 63. Given it's costly and we should avoid it, the new VM cycles for both safe(to check if the top value is less than 64 or not) and unsafe option where the value is at the top of the stack are as follows:

Safe: 46 cycles
Unsafe: 40 cycles

Meanwhile the binacc operation can also be used to determine if the value can be represented by n number of bits.

0xkanekiken avatar Jun 30 '22 21:06 0xkanekiken

Thinking about it a bit more, I believe we can simplify things a bit by changing slightly how binacc works.

Assume the stack is in this state:

[0, exp, acc, a]

Applying binacc to this state could result in the follwing:

[b, exp * exp, acc * ((exp - 1) * bit + 1), a >> 1]

Where $b$ is the least significant bit of $a$. The constraint for a' = a >> 1 could be $a' \cdot 2 + b = a$ - so, we don't need to waste a division to check this.

If this approach works, we could save a couple of cycles as we only need to verify that the final value of $a$ is 0.

We could also put bit values in h0 - this way, we should save a couple of cycles on PUSH/DROP.

bobbinth avatar Jul 01 '22 05:07 bobbinth

I guess we then wouldn't need to do a binary decomposition of the value on top of the stack?

An high level implementation & AIR constraints of the above approach are as follows:

function op_Binacc(stack)
    
    exponent = stack[0]
    acc = stack[1]
    b = stack[2]
	
    bit = b & 1
    acc = acc * ((exponent - 1) * bit + 1)
    b = b >> 1
    exponent = exponent * exponent
    
    h0 = bit
    stack[0] = exponent
    stack[1] = acc
    stack[2] = b

    
function pow2(spans)
    spans.push(Pad) 
    spans.push(Incr) # acc = 1
    spans.push(Dup) 
    spans.push(Incr) # exp = 2
    spans.push(Binacc)
    spans.push(Binacc)
    spans.push(Binacc)
    spans.push(Binacc)
    spans.push(Binacc)
    spans.push(Binacc)
    spans.push(Drop)
    spans.push(Swap)
    spans.push(Eqz)
    spans.push(Assert)

It will now take 14 VM cycles with no push operation.

Binacc Air Constraints

I've adopted the following notation to describe AIR constraints: for column $z$, we represent the value in the current row & next row as $z$ and $z'$ respectively. $s_0 ... s_{15}$ are the columns representing the top $16$ slots of the stack.

The helper column value should remain binary.

$$ f_{binacc} \cdot (h_{0}^2 - h_{0}) = 0 $$

The transition constraint for exp or $s_{0}$ is as follows:

$$ f_{binacc} \cdot (s_{0}' - s_{0} * s_{0}) = 0 $$

The transition constraint for acc or $s_{1}$ is as follows:

$$ f_{binacc} \cdot (s_{1}' - s_{1} * ((s_{0} - 1) * h_{0} + 1 )) = 0 $$

The transition constraint for a or $s_{2}$ is as follows:

$$ f_{binacc} \cdot (2 \cdot s_{2}' + h_{0} - s_{2}) = 0 $$

The transition constraint for the rest of the columns whose values should remain the same are as follows:

$$ f_{binacc} \cdot (s_{i}' - s_{i}) = 0 \ \text{ for } i \in {3, .., 15} $$

0xkanekiken avatar Jul 01 '22 09:07 0xkanekiken

Looks good! By convention, instructions can access helper registers in the current cycle (rather than the next). So, we should just replace $h_0'$ with $h_0$ in the above.

bobbinth avatar Jul 01 '22 14:07 bobbinth

The task can be broken down into the following sequential parts:

  • [X] Update pow2 assembly parser to use the 38 cycle procedure.
  • [x] Remove pow2 VM operation and the associated co-processor.
  • [ ] Implement binacc (or similar) VM operation.
  • [ ] Update pow2 assembly parser with the more efficient implementation.
  • [ ] Update mdbook

0xkanekiken avatar Jul 04 '22 09:07 0xkanekiken

@bobbinth or @0xKanekiKen what's the status of this issue? The binacc PR has been open for a while. Is this change on hold?

grjte avatar Sep 13 '22 08:09 grjte

Yeah - for now I'm not sure if we need a dedicated binacc operation.

bobbinth avatar Sep 13 '22 08:09 bobbinth

Closed by #420.

0xkanekiken avatar Oct 07 '22 08:10 0xkanekiken