WIP: Wide int ergonomic
Moved from PR https://github.com/algorand/pyteal/pull/541. For issue https://github.com/algorand/pyteal/issues/246
Context
- I tried to fix the https://github.com/algorand/pyteal/pull/236 in https://github.com/algorand/pyteal/pull/541 but found it very different from what I wanted to do. So I created a new branch instead of reverting all commits of it.
- This PR is based on https://github.com/algorand/pyteal/pull/543, after which
addw,expw,mulw,divmodwshould be implemented with MultiValue. - This PR focus on the ergonomic features like Operator Overloading,
WideSummationandWideRatio. (names TBD) - I'm trying to finish this ASAP, with test, as our team needs it urgently.
Tasks
- [x] Add a new class
WideIntsimilar toMultiValueand use it in "wideexpr" to substituteMultiValue. For a variablew:WideInt,w.high()load the high part of the wide integer from scratch space,w.low()the low part.WideInthas 3 constructor overloads (see next chapter) - [x] Update
WideExprto return aWideInt(in scratch space) - [x] Operator Overloading II: Use
+(__pos__) to convert aIntto aWideInt(with high word = 0) - [ ] Convert / cast a
WideInttoInt. Fail when it has a non-zero high word. This is a shortcut forw.low.load() if w.high.load()==0 else ERROR - [ ] More Operator Overloading for
Int OP WideIntandWideInt OP WideInt, where OP can be+,-,*,etc. This was done here and I'll update them(TBD) if allowed. - [ ] Add ergonomics
ModW,MinusW,WideAdd,WideMul,WideSummationandWideRatioto process calculation betweenWideInts - [ ] DocStr with small examples for docs (I think the doc is auto-generated by the doc string)
- [ ] An example in "/examples"
WideInt constructors
WideInt(num: int) -> WideInt WideInt(expr_high: Expr, expr_low: Expr) -> WideInt WideInt(expr_high: ScratchSlot, expr_low: ScratchSlot) -> WideInt
Create a new WideInt.
- For int:
Pass the int as the only argument. For example,
WideInt((1<<64)+1). - For two Expr() objects:
Pass two
Exprs. They must be of typeTealType.uint64For example,WideInt(Int(1),Int(1)). - For two ScratchSlot() objects:
Pass two
ScratchSlots. For example,WideInt(ScratchSlot(), ScratchSlot()).
To be decided
- If the names
WideInt,WideExpr,WideAdd,WideMulare good? Alternatives are:WideUint64,MultiExpr,AddWW,MulWWrespectively - If using
+(__pos__) to convert aIntto aWideInt(high word = 0) is a good practice? - Too much constructor overloading? (will explain below)
- Should I change
DivWso it accept params ofWideInt(I think not, but add aWideDivinstead) - Should I touch the codes here to perform the universal operator overloading.
https://github.com/algorand/pyteal/blob/53c8e00cce2c84114333458fc4a6e002bb884696/pyteal/ast/expr.py#L72-L75
Too much constructor overloading
In the last commit a11490a, I created a class WideInt which can be initiated / constructed from Int or Expr,Expr.
However, when moving to the next task, where I want mulw to return a WideInt, I saw that I might need a new overload with two scratch slots.
I think this would be too much and would make the code hard to understand. But it seems a must if we want to use MultiValue.
My solution was to use some @classmethod like fromInt, fromExpr, fromExprExpr, etc to construct a WideInt but this will make the creation process harder. I don't have a satisfying solution... Please help 🥲
Should I use multiple dispatch @dispatch in this case?
I found some ugly solution with plum 🤪 which passes pytest but fails flake. I try to move back to @overload
I agree on __invert__ to be used by BitwiseNot. I now use __pos__ instead, to convert uint64 to uint128.
I'm sorry that this WIP PR is progressing slowly due to the inactive support. Our team decided to pause here. The unfinished part is not that hard with these ideas below in mind. I hope someone can finish it or I would come back to finish it (much) later. I can offer a VSCode live share when help is needed.
- Assume that we always start with an empty stack and a bunch of WideInt (scratched variables), to save the "load / store" costs, we only load what's needed. So we should use more
TealSimpleBlock. I suggest to create a new class taking__teal__as a parameter of__init__. - Notations: Note the list of
WideIntfor summation/production asW1,W2,...,Wn, their high words areh1,h2,...,hn, andl1,l2,...,lnfor their low words. NoteS_kas the partial summation of the firstkterms. The.lowand.highaccess the corresponding part of aWideInt.hk=Wk.high,lk=Wk.low. LetM = 2**64, thenW1 = h1*M + l1, W2 = h2*M + l2, ..., Wn = hn\*M + ln. - A
IntSum: *Ints => WideInt. NoteNas the length of the input. Run in each iterationaddwthenuncover {N-1}. Starting with stacki1, i2, ..., iN, after k iterations, we have the stacki_(k+2), i_(k+2),...,i_(N), carrier2, carrier3, carrier4,... carrier_(k+1), S_(k+1).low. After (N-1) iterations the stack becomescarrier2, carrier3, carrierN, S.low. StoreS.low, then runadd(N-2) times to getS.high. StoreS.highand we have both parts of the final sumS. - In N-ary
WideSum, we can add the low bits together, and the high bits together, it should have different__teal__than a bunch of binaryWideSum. The N-aryS=WideSum(W1,W2,...,Wn)can be split to 2 parts; a. Loadl1,l2,...,lnto stack. PerformL:WideInt = IntSum(l1,l2,...,ln). Now we have low word of the final resultS.low = L.low. Store it and leaveL.highon stack. b. Loadh1,h2,...,hnto stack. PerformH:WideInt = IntSum(L.high,h1,h2,...,hn). Overflow ifH.hi != 0. Now we have high word of the final resultS:S.high=H.low. - For production, note
P_korPkas the partial product of firstkfactors. Notei_korikas thek-thInt(factor). - N-ary
IntProd: Use a loop ofmultiplyFactorsfrom "widemath.py". - Binary
WideProdproduces product of twoWideInt, and only one of them can have non-zero high word. (otherwise the product overflows 128bit uint.) So we can cast the one with high word == 0 to anInt. Then usemultiplyFactorsfrom "widemath.py":[..., A*C+highword(B*C), lowword(B*C)] - To perform a N-ary
WideProd, at most 1 of the multiplicands can have non-zero high word (overflow otherwise). Say it's W1. Then the productPis:
$$ \begin{align} \begin{aligned} P &= W_1\cdot W_2\cdot \ldots \cdot W_n \ &= (h_1\cdot M+l_1) \cdot (0\cdot M+l_2) \cdot \ldots \cdot (0\cdot M+l_n) \ &= (h_1\cdot l_2\cdot l_3\cdot \ldots \cdot l_n) \cdot M + (l_1\cdot l_2\cdot l_3\cdot \ldots \cdot l_n) \ &= h_1\cdot (l_2\cdot l_3\cdot \ldots \cdot l_n) \cdot M + l_1\cdot (l_2\cdot l_3\cdot \ldots \cdot l_n) \end{aligned} \end{align} $$
If we regard (l2*l3*...*ln)==(W2*W3*...*Wn) as H, and W = W1. The algorithm here is same as W*H.
So we only need IntProd and a binary WideProd to implement the N-ary WideProd.
Just chiming in to say this is really important work, there are many scenarios in which you cannot safely multiply two integers on Algorand since e.g. base units are capped at the entire 64 bits, so the lack of proper 128 bit support shoe-horns you into using byte arrays for everything (or writing raw TEAL), even when they're otherwise overkill. For my use case it was enough to create my own Expr subclasses, but it certainly is not an ideal developer experience to need to do such things.
@zejoeh Thanks for your appreciation, although this is not the priority of the team right now (see https://github.com/algorand/pyteal/pull/543#issuecomment-1260032243). I tried to achieve the the purpose you said with this PR (didn't finish it due to the time limitation).
Doing it dirty yet quick, I've reduced the decimals on everything so 128-bit uint is not that heavily required. But still, I had to do similar things like writing some Expr or Seq to get the functionalities I needed.
Speaking of Seq, it comes to me that more ergonomic features are needed, like the following CachedSeq. I'm thinking to merge them to pyteal-utils or to this repo after this intensive period.
class CachedSeq:
cached: bool
@overload
def __init__(self, *exprs: Expr) -> None: # overload_1
pass
@overload
def __init__(self, exprs: list[Expr]) -> None: # overload_2
pass
def __init__(self, *exprs): # type: ignore : type copied from seq.py
if len(exprs) == 1 and isinstance(exprs[0], list):
exprs = exprs[0]
self.expr = Seq(*exprs)
self.cache_scratch = ScratchVar(self.expr.type_of())
self.cached = False
def load(self):
if not self.cached:
self.cached = True
return Seq(self.cache_scratch.store(self.expr), self.cache_scratch.load())
return self.cache_scratch.load()
@PabloLION Since you are in control of decimals, maybe you have a similar situation as I did. I use the following much-less-general solution, but given the amount of work you put into this I'm sure you already have something similar or better for what you need:
class MulwDivw(pt.Expr):
"""Wide multiplication of two 64 bit uints a and b followed by wide division by a 64 bit uint c."""
def __init__(self, a: pt.Expr, b: pt.Expr, c: pt.Expr):
super().__init__()
self.a = a
self.b = b
self.c = c
def __teal__(self, options: "CompileOptions"):
# mulw of a and b
mulw_start, mulw_end = pt.TealSimpleBlock.FromOp(options, pt.TealOp(self, pt.Op.mulw), self.a, self.b)
# divw of result and c
divw_start, divw_end = pt.TealSimpleBlock.FromOp(options, pt.TealOp(self, pt.Op.divw), self.c)
mulw_end.setNextBlock(divw_start)
return mulw_start, divw_end
def __str__(self):
return f"(MulwDivw ({self.a},{self.b},{self.c}))"
def type_of(self):
return pt.TealType.uint64
def has_return(self) -> bool:
return False
For me this was enough to replace a lot of byte array operations in intermediary computations after verifying that the results all still fit in 64 bits, without having to handle the multiple return values problem. Using c=2**64-1 it also lets you anticipate and handle overflow at runtime with some branching.
Of course with only this I still need to make sure my domain is otherwise bounded by 64 bits (or use byte arrays where it can't be). It would be great to have native 16 byte (128 bit)- and 32 byte (256 bit) arrays instead - a man can dream. 😄
@zejoeh I see this is correct and useful because divw will conveniently "Fail if C == 0 or if result overflows". And actually I don't have a similar implementation yet, although this is my most wanted functionality. Thank for sharing. But in only one occasion I still need 128bit add. So I'm still planning to finish this later. The difficulty, to be honest, is allow user to use the 128-bit uint variable at any time, so scratch slot seems inevitable to me.
I was implementing WideAdd (two 128-bit uint => one 128-bit uint) with a lot of scratch slot. After seeing your code I found maybe my method is way less performant than yours.
With my plan, to achieve the same calculation, with WideMul then WideDiv. The generated TEAL would first convert two 64-bit uint to two 128-bit with high word==0 (opcodes: int 0x2, storex4) , then load them (opcost 4) and run a 5-opcost method between 2 128-bit. These sums up to 15 opcost. The divide procedure should be another 15-ish opcost, totalling up to 30 opcost.
In comparison, your MulwDivw has a opcost 5. Pessimistically thinking, seeing your code added more tasks for this PR 🤪
Also FYI, it's already possible to do 512-bit calculation with Byteslice (docs at https://pyteal.readthedocs.io/en/stable/arithmetic_expression.html#byteslice-arithmetic)