document futures size explosion pitfall and mitigations
As explained e.g. here, futures can easily turn into unpleasant memory hog surprises.
One possible easy way to investigate problematic code is to add a future-size-threshold = 100 (or whichever threshold seems reasonable) clippy lint.
Further information: withoutboats: Futures and Segmented Stacks
We would welcome a PR explaining this. Please feel free to write it if you have the time.
I don't feel qualified enough to do a detailed writeup, so would rather see somebody else spearheading this. If nobody else takes it up, I can try and come up with 1-2 paragraphs…
some more information via a discussion in Matrix. Thanks to @fg-cfh for digging those out:
- async block occupies twice the size of its captures (causing explosion in size).
- Async fn doubles argument size
they (fg-cfh) looked into this deeper and discovered:
Looking at the MIR reveals the problem: The optimizer gets no chance, as unneded state is kept live on purpose due to conservative assumptions that are required to keep up soundness. Basically the desugaring assumes that you're generating and esacping pointers of each and every local variable and argument, so they cannot be optimized away. Getting rid of that assumption would require the compiler to acquire a much deeper understanding of the poll functions than it is currently able to.
more from that discussion, quoting & paraphrasing @fg-cfh:
- useful to figure out the biggest offenders:
cargo +nightly rustc --release -p ... --target ... --bin ... -- -Zprint-type-sizes - potential mitigations: hand-optimize futures as much as possible
- remove arguments
- make state static
- turn async blocks into futures with combinators.
- "users should also be warned that async blocks will blow up code size as they usually land in .text rather than .bss. As they contain so much (uninitialized) bloat that should actually be in .bss, this will occupy much more flash space than a comparable hand-coded state machine."
This blog post linked from somewhere in the issues seems a good description of the problem, I think.
This seems like a fundamentally serious and hard problem with async/await. To be honest, I'm not sure why I've never heard of it, as it seems like a really important caveat not just in embedded, but even for big-computer programs. Call stacks can easily get quite deep: $4^{10}$ bytes is 1MB, a big number even when you have GB of memory lying around. It seems like memcopying this data internally would make the program run like molasses and completely destroy the cache.
Apparently most of the time the compiler fixes this by inlining .await and optimizing the result. I don't know that this happens if you compile with the debug profile, though, which is pretty common in embedded when trying to debug something. There is also a way to get rid of this in many cases by replacing async fn with a function that explicitly returns a future, but this seems pretty awful to put on program developers.
There are a bunch of good-looking proposals to fix this in the compiler, but for some reason I don't understand it doesn't seem to be a priority. The issue has been known since before async/await was stabilized, so it's not great that it still can occur now. If nothing else, it would make a good GSoC project for next summer if no one has fixed it by then. ([Narrator]: It had not been fixed by then.)
I think a single paragraph might need to be here somewhere, describing the problem and referencing a bunch of the links for the details. It is indeed a hard paragraph to write, and I don't feel confident enough with async/await to do it either.
We discussed this issue in the Rust Embedded Working Group meeting today, and are trying to figure out how to proceed. It's a tough one.
I have caught up with the discussion in Matrix. I think what @adamgreig suggested makes sense: provide a brief note for now, and expand on that later. My suggestion would be to have a separate issue for the "brief for now" approach, because the current issue contains a collection of information relevant for a longer writeup
I guess I don't quite know what to put in the brief note. "On rare occasions you may find that your embedded program runs out of memory because of a misfeature of the current implementation of Rust async/await that can cause ridiculously large structures to be created as the result of [???]. You can quickly recognize this situation by [???], and deal with it by [???]." If I could fill in the missing pieces here, it would make sense to put that in. As it is…
I'm not sure why I've never heard of it.
You're not the only one. A warning in the embedded book would have saved me some headache.
I guess I don't quite know what to put in the brief note.
I suggest that we mention the tooling that allows developers to observe the impact of using async/await. That's what would have helped me:
- cargo size
- cargo bloat
- -Zprint-type-sizes together with top-type-sizes.
Maybe this writeup merits a link from the embedded book (plain futures):
https://rust-lang.github.io/wg-async/vision/submitted_stories/status_quo/barbara_carefully_dismisses_embedded_future.html
Among other observations, it points out, that Aaron Turon's "futures are zero-cost" relates to time impact of hand-written futures and futures-rs combinators only, not to space impact or async/await blocks. I admit that I misunderstood his blog post, too.
The blog post you cited might also be linked (async/await).
Finally, I suggest we link to the following issue search:
https://github.com/rust-lang/rust/issues?q=state%3Aopen%20label%3A%22A-async-await%22%20label%3AI-heavy
Apparently most of the time the compiler fixes this by inlining
.awaitand optimizing the result.
I cannot confirm this from my own experiments.
In our case, desugaring an async call hierarchy results in nested enums (state machine state) + nested function calls (nested match-blocks on those enums + pre/post sync code).
LLVM is able to inline the nested match blocks but not nested enums. IIUC, from LLVM's viewpoint those enums are just arbitrary data structures with a fixed memory layout that it is not entitled to tamper with.
I gained the following insights from -Zprint-type-sizes:
-
Even in --release mode, each async nesting level seems to add an additional enum nesting level, capturing a copy of each local upvar at MIR level (no matter whether that upvar is later inlined by LLVM, required across the first await point, already captured further up in the call stack or even used at all, including [&[mut]]self).
-
Structures (including self passed by value) seem to be captured wholly, not at field level or as a reference. Typical optimizations like instantiating a result struct one level up don't seem to apply to async code AFAICT.
-
Upvars and local variables actually used in the function body across await points will then be captured again according to the usual closure capturing rules.
-
Exponential blowup stems from the fact that determinant + upvars (=args) + captured vars from the next lower level are recursively added as a sub-field to the enum one level up, even if only a single .await point exists one level down.
-
For a generic library as ours, hand-written futures are not an option: You loose all the benefits of
impl Futureand end up with as many generic arguments as you have futures.
Example: In our case, passing a seemingly harmless radio task (<30 bytes) down to an inner async function means keeping a copy of it at each nesting level. This plus determinants and a few local variables derived from it add up to hundreds of redundant bytes in the resulting future although the base object (self) is a ZST.
Those bytes are then multiplied by six (due to an intended 2x3 generic cross-product of types) and as such fully weigh into .data and .text rather than .bss. This is the result after applying several rounds of optimization (that already gave us a 70% reduction in stack usage). These optimizations impacted maintainability quite a bit, though.
IIRC, the officially recommended tokio workaround is to systematically Box large futures along the call stack. That obviously isn't an option in a no-alloc crate. One commentor went so far as to recommend not to nest async functions/blocks at all.
I didn't make up my mind yet, but we're now re-implementing everything w/o async so we can benchmark the two solutions against each other. Then we'll take a decision.
Update: I want to say this clearly - Most of this hurts applications much less than a generic library as ours. As the author of a highly space+time sensitive library, I want zero-cost abstractions as much as possible. As an application author, I'm satisfied with "good enough". So this doesn't invalidate most async embedded use cases of course.
Thanks much for the huge and detailed analysis! I will think on this and talk to some key people — there's a lot going on here.