Nim
Nim copied to clipboard
Function-based closure iterator implementation
Summary
The current implementation of closure iterators sees the implementation use a duff's device to implement the state machine between each yield.
A more efficient approach would be to put each state in its own function and change states by assigning a function pointer (instead of an index).
Description
A nim function like:
iterator test(): int {.closure.} =
yield 0
yield 1
yield 2
currently gets transformed into a giant switch like so:
N_LIB_PRIVATE N_CLOSURE(NI, test__testit_3)(void* ClE_0) {
NI result;
tyObject_Env_testitdotnim_test___LFfJfhA7fiXxmdXYuA4CYw* colonenvP_;
{ result = (NI)0;
colonenvP_ = (tyObject_Env_testitdotnim_test___LFfJfhA7fiXxmdXYuA4CYw*) ClE_0;
while (1) {
if (!1) goto LA1;
{
switch ((*colonenvP_).colonstate_) {
case -1:
goto BeforeRet_;
case 0: goto STATE0;
case 1: goto STATE1;
case 2: goto STATE2;
case 3: goto STATE3;
}
STATE0: ;
(*colonenvP_).colonstate_ = ((NI) 1);
result = ((NI) 0);
goto BeforeRet_;
STATE1: ;
(*colonenvP_).colonstate_ = ((NI) 2);
result = ((NI) 1);
goto BeforeRet_;
STATE2: ;
(*colonenvP_).colonstate_ = ((NI) 3);
result = ((NI) 2);
goto BeforeRet_;
STATE3: ;
(*colonenvP_).colonstate_ = ((NI) -1);
goto LA2;
} LA2: ;
} LA1: ;
}BeforeRet_: ;
return result;
}
A more efficient approach would be to put the code of each state into a separate function, roughly:
test__testit_3_STATE0(...) {
...
(*colonenvP_).colonstate_ = test__testit_3_STATE1;
}
test__testit_3_STATE1(...) {..}
...
Because each state is independent, this would allow the C compiler to better analyze the function and for example perform more aggressive inlining which is more difficult when all states share a single function.
This would also open up another feature, namely that of moving variables out of the heap-based environment and putting them in the "small" local functions, where applicable (ie where a variable does not cross state boundaries).
Such small functions are also candidates for dead-code-elimination - ie if it turns out that a particular state never is used due to inlining, constant propagation and other optimization, that state can be removed entirely instead of still taking up space - this happens commonly in generic code and/or macro-generated code.
In the example, there is of course little benefit, but closure iterators are the mechanism that power chronos and other async implementations, where code is often much more complex, leading to very large state transition functions.
Alternatives
No response
Examples
No response
Backwards Compatibility
No response
Links
No response
I wanted to do that a while ago, and still think I might, but such change will potentially break existing code in 2 ways:
- Closure env
:statefield will change from int to function pointer. It's a low level abi change, that anyone will hardly notice, and whoever relies on it (like me) will likely be ok with the breakage. - Optimizing locals out of the env will break existing code that takes addresses of such locals and relies on the fact that it will live through the yield. With this optimisation it might not. This is a more critical, "user-space" breakage. We could unoptimize vars of which address is taken, but it's a hacky heuristical complication.
Well, 1. is not defined behavior, is it? ie the internal fields are internal for a reason, no?
re 2., this feels like it falls into UB territory, and is dubious at best to begin with - is it documented or just an implementation detail that some have started using? basically, the address of a local is never guaranteed to be unique or not to be overwritten, why would one expect that in a closure environment? consider that if you have a local whose liveness is bounded, one is typically allowed to reuse that memory location for a different local as long as the liveness analysis shows no overlap - there is no expectation in general that memory addresses that are not on the heap will be stable - just like you wouldn't return the address of a local from a function. Also, the addr thing, why is it hacky? It simply expands the liveness to "undefined" which pessimistically means it can be moved to the env..
- Sure
- Thinking more about it, it should be fine for the most part. Reads-into-buffers imply that the buffer is used after yield, and thus survives.
However there are still cases like this, which is on the edge of memory safety.Pointers in async are sometimes the way to go, because async doesn't allow passing by var.addranalysis is hacky because it will be done only within the closure iter, whileaddrop itself can be in the callees, and againaddrdoesn't automatically mean that it is expected to live through the yield, but alladdrswill be unfairly unoptimized. But in the end I agree that probability of old code failing is pretty low, and I would gladly neglect it.
EDIT: Memory-unsafe example is safe. So now I'm not aware of any live code that will fail.
while addr op itself can be in the callees,
I think in nim, we generally assume that if something is passed by var into a function (and not by ptr), there is no lifetime expectation of the address of the var beyond that function call, ie storing addr inside a function of a var parameter is UB beyond that one call.
Correct.
When experimenting with manual CPS, I've looked into the storage problem, it's a raw but this works, except with JS backend: https://github.com/nim-works/cps/blob/7e6ed65/talk-talk/manual1_stack.nim
- Every yield section of a CPS-ed functions or closure iterator gets it's own env object
Env_foo_0 = object
n_gensymmed: int
i_gensymmed: int
Env_foo_1 = object
n_gensymmed: int
i_gensymmed: int
Env_foo_2 = object
n_gensymmed: int
i_gensymmed: int
j_gensymmed: int
- The full state is stored in an
{.union}to guarantee that space is the biggest state used. For compat with JS this can be changed to a case object, but the states and transitions are known at compile-time so it's not really needed.
# This is an object which is large enough to hold any of the above, and is
# used for the initial allocation.
Env_foo_storage {.union.} = object
stor_Env_foo_0: Env_foo_0
stor_Env_foo_1: Env_foo_1
stor_Env_foo_2: Env_foo_2
stor_Env_foo_3: Env_foo_3
stor_Env_foo_4: Env_foo_4
stor_Env_foo_5: Env_foo_5
stor_Env_foo_6: Env_foo_6
stor_Env_foo_7: Env_foo_7
- The continuation (or closure iterator), stores the function and the envs storage
C = object
fn: proc(c: var C) {.nimcall.}
storage: Env_foo_storage
- Each segment restores env variables, does processing, and swap the function call to the next one
# Original function
proc foo(n: int) =
var i = 0
while i < n:
# sleep(1)
var j = 0
while j < n:
echo i, " ", j
# sleep(1)
inc j
inc i
var j = "done"
# sleep()
echo j
template injectVar(T, id: untyped) =
template id(): untyped = (c.storage.`stor _ T`.`id gensymmed`)
proc foo_0(c: var C) =
injectVar(Env_foo_0, n)
injectVar(Env_foo_0, i)
i = 0
c.fn = foo_1
proc foo_1(c: var C) =
injectVar(Env_foo_1, n)
injectVar(Env_foo_1, i)
if i < n:
c.fn = foo_2
noop(c)
return
c.fn = foo_6
proc foo_2(c: var C) =
injectVar(Env_foo_2, n)
injectVar(Env_foo_2, i)
injectVar(Env_foo_2, j)
j = 0
c.fn = foo_3
And interesting part is that if the continuation/closure:
- does not use ref
- does not escape its scope
You can allocated it fully on the stack (called halo - heap allocation elision in Clang++). And if you allocate it fully on the stack, the C compiler can do constant propagation and folding:
- for closure iterator chains like used with C++ ranges, D-ranges or functional programming (with map / filter / sum / ...)
- for stream processing for example when reading from a socket