pyteal WIP: Wide int ergonomic

Moved from PR https://github.com/algorand/pyteal/pull/541. For issue https://github.com/algorand/pyteal/issues/246

Context

I tried to fix the https://github.com/algorand/pyteal/pull/236 in https://github.com/algorand/pyteal/pull/541 but found it very different from what I wanted to do. So I created a new branch instead of reverting all commits of it.
This PR is based on https://github.com/algorand/pyteal/pull/543, after which addw, expw, mulw, divmodw should be implemented with MultiValue.
This PR focus on the ergonomic features like Operator Overloading, WideSummation and WideRatio. (names TBD)
I'm trying to finish this ASAP, with test, as our team needs it urgently.

Tasks

[x] Add a new class WideInt similar to MultiValue and use it in "wideexpr" to substitute MultiValue. For a variable w:WideInt, w.high() load the high part of the wide integer from scratch space, w.low() the low part. WideInt has 3 constructor overloads (see next chapter)
[x] Update WideExpr to return a WideInt (in scratch space)
[x] Operator Overloading II: Use + ( __pos__) to convert a Int to a WideInt (with high word = 0)
[ ] Convert / cast a WideInt to Int. Fail when it has a non-zero high word. This is a shortcut for w.low.load() if w.high.load()==0 else ERROR
[ ] More Operator Overloading for Int OP WideInt and WideInt OP WideInt, where OP can be +,-,*,etc. This was done here and I'll update them(TBD) if allowed.
[ ] Add ergonomics ModW, MinusW, WideAdd,WideMul, WideSummation and WideRatio to process calculation between WideInts
[ ] DocStr with small examples for docs (I think the doc is auto-generated by the doc string)
[ ] An example in "/examples"

`WideInt` constructors

WideInt(num: int) -> WideInt WideInt(expr_high: Expr, expr_low: Expr) -> WideInt WideInt(expr_high: ScratchSlot, expr_low: ScratchSlot) -> WideInt

Create a new WideInt.

For int: Pass the int as the only argument. For example, WideInt((1<<64)+1).
For two Expr() objects: Pass two Exprs. They must be of type TealType.uint64 For example, WideInt(Int(1),Int(1)).
For two ScratchSlot() objects: Pass two ScratchSlots. For example, WideInt(ScratchSlot(), ScratchSlot()).

To be decided

If the names WideInt, WideExpr,WideAdd,WideMul are good? Alternatives are: WideUint64, MultiExpr, AddWW, MulWW respectively
If using + ( __pos__) to convert a Int to a WideInt (high word = 0) is a good practice?
Too much constructor overloading? (will explain below)
Should I change DivW so it accept params of WideInt (I think not, but add a WideDiv instead)
Should I touch the codes here to perform the universal operator overloading.

https://github.com/algorand/pyteal/blob/53c8e00cce2c84114333458fc4a6e002bb884696/pyteal/ast/expr.py#L72-L75

Sep 26 '22 06:09 PabloLION

Too much constructor overloading

In the last commit a11490a, I created a class WideInt which can be initiated / constructed from Int or Expr,Expr.

However, when moving to the next task, where I want mulw to return a WideInt, I saw that I might need a new overload with two scratch slots. I think this would be too much and would make the code hard to understand. But it seems a must if we want to use MultiValue.

My solution was to use some @classmethod like fromInt, fromExpr, fromExprExpr, etc to construct a WideInt but this will make the creation process harder. I don't have a satisfying solution... Please help 🥲

Should I use multiple dispatch @dispatch in this case?

Sep 26 '22 06:09 PabloLION

I found some ugly solution with plum 🤪 which passes pytest but fails flake. I try to move back to @overload

Sep 26 '22 09:09 PabloLION

I agree on __invert__ to be used by BitwiseNot. I now use __pos__ instead, to convert uint64 to uint128.

Sep 26 '22 20:09 PabloLION

I'm sorry that this WIP PR is progressing slowly due to the inactive support. Our team decided to pause here. The unfinished part is not that hard with these ideas below in mind. I hope someone can finish it or I would come back to finish it (much) later. I can offer a VSCode live share when help is needed.

Assume that we always start with an empty stack and a bunch of WideInt (scratched variables), to save the "load / store" costs, we only load what's needed. So we should use more TealSimpleBlock. I suggest to create a new class taking __teal__ as a parameter of __init__.
Notations: Note the list of WideInt for summation/production as W1,W2,...,Wn, their high words are h1,h2,...,hn, and l1,l2,...,ln for their low words. Note S_k as the partial summation of the first k terms. The .low and .high access the corresponding part of a WideInt. hk=Wk.high, lk=Wk.low. Let M = 2**64, then W1 = h1*M + l1, W2 = h2*M + l2, ..., Wn = hn\*M + ln.
A IntSum: *Ints => WideInt. Note N as the length of the input. Run in each iteration addw then uncover {N-1}. Starting with stack i1, i2, ..., iN, after k iterations, we have the stack i_(k+2), i_(k+2),...,i_(N), carrier2, carrier3, carrier4,... carrier_(k+1), S_(k+1).low. After (N-1) iterations the stack becomes carrier2, carrier3, carrierN, S.low. Store S.low, then run add (N-2) times to get S.high. Store S.high and we have both parts of the final sum S.
In N-ary WideSum, we can add the low bits together, and the high bits together, it should have different __teal__ than a bunch of binary WideSum. The N-ary S=WideSum(W1,W2,...,Wn) can be split to 2 parts; a. Load l1,l2,...,ln to stack. Perform L:WideInt = IntSum(l1,l2,...,ln). Now we have low word of the final result S.low = L.low. Store it and leave L.high on stack. b. Load h1,h2,...,hn to stack. Perform H:WideInt = IntSum(L.high,h1,h2,...,hn). Overflow if H.hi != 0. Now we have high word of the final result S: S.high=H.low.
For production, note P_k or Pk as the partial product of first k factors. Note i_k or ik as the k-th Int (factor).
N-ary IntProd: Use a loop of multiplyFactors from "widemath.py".
Binary WideProd produces product of two WideInt, and only one of them can have non-zero high word. (otherwise the product overflows 128bit uint.) So we can cast the one with high word == 0 to an Int. Then use multiplyFactors from "widemath.py": [..., A*C+highword(B*C), lowword(B*C)]
To perform a N-ary WideProd, at most 1 of the multiplicands can have non-zero high word (overflow otherwise). Say it's W1. Then the product P is:

$$ \begin{align} \begin{aligned} P &= W_1\cdot W_2\cdot \ldots \cdot W_n \ &= (h_1\cdot M+l_1) \cdot (0\cdot M+l_2) \cdot \ldots \cdot (0\cdot M+l_n) \ &= (h_1\cdot l_2\cdot l_3\cdot \ldots \cdot l_n) \cdot M + (l_1\cdot l_2\cdot l_3\cdot \ldots \cdot l_n) \ &= h_1\cdot (l_2\cdot l_3\cdot \ldots \cdot l_n) \cdot M + l_1\cdot (l_2\cdot l_3\cdot \ldots \cdot l_n) \end{aligned} \end{align} $$

If we regard (l2*l3*...*ln)==(W2*W3*...*Wn) as H, and W = W1. The algorithm here is same as W*H. So we only need IntProd and a binary WideProd to implement the N-ary WideProd.

Sep 27 '22 11:09 PabloLION

Just chiming in to say this is really important work, there are many scenarios in which you cannot safely multiply two integers on Algorand since e.g. base units are capped at the entire 64 bits, so the lack of proper 128 bit support shoe-horns you into using byte arrays for everything (or writing raw TEAL), even when they're otherwise overkill. For my use case it was enough to create my own Expr subclasses, but it certainly is not an ideal developer experience to need to do such things.

Oct 03 '22 12:10 zejoeh

@zejoeh Thanks for your appreciation, although this is not the priority of the team right now (see https://github.com/algorand/pyteal/pull/543#issuecomment-1260032243). I tried to achieve the the purpose you said with this PR (didn't finish it due to the time limitation). Doing it dirty yet quick, I've reduced the decimals on everything so 128-bit uint is not that heavily required. But still, I had to do similar things like writing some Expr or Seq to get the functionalities I needed.

Speaking of Seq, it comes to me that more ergonomic features are needed, like the following CachedSeq. I'm thinking to merge them to pyteal-utils or to this repo after this intensive period.

class CachedSeq:
    cached: bool

    @overload
    def __init__(self, *exprs: Expr) -> None:  # overload_1
        pass

    @overload
    def __init__(self, exprs: list[Expr]) -> None:  # overload_2
        pass

    def __init__(self, *exprs):  # type: ignore : type copied from seq.py
        if len(exprs) == 1 and isinstance(exprs[0], list):
            exprs = exprs[0]
        self.expr = Seq(*exprs)
        self.cache_scratch = ScratchVar(self.expr.type_of())
        self.cached = False

    def load(self):
        if not self.cached:
            self.cached = True
            return Seq(self.cache_scratch.store(self.expr), self.cache_scratch.load())
        return self.cache_scratch.load()

Oct 04 '22 06:10 PabloLION

@PabloLION Since you are in control of decimals, maybe you have a similar situation as I did. I use the following much-less-general solution, but given the amount of work you put into this I'm sure you already have something similar or better for what you need:

class MulwDivw(pt.Expr):
    """Wide multiplication of two 64 bit uints a and b followed by wide division by a 64 bit uint c."""

    def __init__(self, a: pt.Expr, b: pt.Expr, c: pt.Expr):
        super().__init__()
        self.a = a
        self.b = b
        self.c = c

    def __teal__(self, options: "CompileOptions"):
        # mulw of a and b
        mulw_start, mulw_end = pt.TealSimpleBlock.FromOp(options, pt.TealOp(self, pt.Op.mulw), self.a, self.b)
        # divw of result and c
        divw_start, divw_end = pt.TealSimpleBlock.FromOp(options, pt.TealOp(self, pt.Op.divw), self.c)
        mulw_end.setNextBlock(divw_start)
        return mulw_start, divw_end

    def __str__(self):
        return f"(MulwDivw ({self.a},{self.b},{self.c}))"

    def type_of(self):
        return pt.TealType.uint64

    def has_return(self) -> bool:
        return False

For me this was enough to replace a lot of byte array operations in intermediary computations after verifying that the results all still fit in 64 bits, without having to handle the multiple return values problem. Using c=2**64-1 it also lets you anticipate and handle overflow at runtime with some branching.

Of course with only this I still need to make sure my domain is otherwise bounded by 64 bits (or use byte arrays where it can't be). It would be great to have native 16 byte (128 bit)- and 32 byte (256 bit) arrays instead - a man can dream. 😄

Oct 04 '22 09:10 zejoeh

@zejoeh I see this is correct and useful because divw will conveniently "Fail if C == 0 or if result overflows". And actually I don't have a similar implementation yet, although this is my most wanted functionality. Thank for sharing. But in only one occasion I still need 128bit add. So I'm still planning to finish this later. The difficulty, to be honest, is allow user to use the 128-bit uint variable at any time, so scratch slot seems inevitable to me.

I was implementing WideAdd (two 128-bit uint => one 128-bit uint) with a lot of scratch slot. After seeing your code I found maybe my method is way less performant than yours.

With my plan, to achieve the same calculation, with WideMul then WideDiv. The generated TEAL would first convert two 64-bit uint to two 128-bit with high word==0 (opcodes: int 0x2, storex4) , then load them (opcost 4) and run a 5-opcost method between 2 128-bit. These sums up to 15 opcost. The divide procedure should be another 15-ish opcost, totalling up to 30 opcost. In comparison, your MulwDivw has a opcost 5. Pessimistically thinking, seeing your code added more tasks for this PR 🤪

Also FYI, it's already possible to do 512-bit calculation with Byteslice (docs at https://pyteal.readthedocs.io/en/stable/arithmetic_expression.html#byteslice-arithmetic)

Oct 04 '22 16:10 PabloLION

WIP: Wide int ergonomic

Context

Tasks

WideInt constructors

To be decided

Too much constructor overloading

`WideInt` constructors