xous-core
xous-core copied to clipboard
General instability with Rust 1.72 + optimizations turned on
In addition to the problems seen in #416, I am finding that the system will hard-hang after being up for over a minute.
I think the problem is related to something in the ticktimer implementation possibly, and/or networking - basically when the RTC update triggers after boot about a minute in, the system hangs with the default release optimizations.
The work-around is to use opt-level = "s"
in Cargo.toml
. This seems to turn off whatever aggressive optimization that is causing our troubles; #416 is no longer seen, and the system runs stably for longer than a minute. So 0d7601ddb84254c71b311d7e39ff4fa0f9ab72dd has been added to the current main
which will allow progress to continue with other development until the Rust issues are resolved.
I am personally not optimistic that this will be resolved anytime soon, because the problem is likely llvm
+ RV32
combo, which I have heard is absolutely not a priority for the LLVM team. The rumor is that LLVM is mostly funded by organizations that care about squeezing a half percent more performance out of x86-64 server code. That other projects can benefit from LLVM is accidental, and thus any bugs found on other architectures tend to not get priority.
Other notes
This might be a red herring, but the system seems to crash right around the time when this message would be printed:
ERR :xous_ticktimer: requested to wake 1 entries, which is more than the current 9 waiting entries (services\xous-ticktimer\src\main.rs:429)
INFO:dns::time: utc_time: 1695105515 (services\dns\src\time.rs:430)
INFO:dns::time: rtc_secs: 182363340 (services\dns\src\time.rs:431)
INFO:dns::time: start_tt_ms: 141755 (services\dns\src\time.rs:432)
I'm not sure what the ticktimer
error is all about, maybe @xobs can shed some light on this? I have seen this before, but I thought it was mostly a harmless warning. But I could also believe that maybe something in Rust 1.72 + optimizations have also unearthed some unsound behavior and turning off optimizations is just hiding a bug in our code.
In fact, that would be my default assumption (our bug), except for the smoking gun in #416 where clearly a ct_eq()
call is being optimized out when it should not be. At this point I really need to get on to working through the backlog of PR's instead of trying to trace down regressions due to Rust 1.72, so I'm going to let this "fix" of turning off optimizations to sit.
Meta-reflection
This is starting to look more and more like an argument to hop off the Rust release train, because the latest releases are more and more often breaking our code (this is not the first time an llvm "upgrade" has broken something, see #285) and is adding asymptotically less quality of life to our applications. The goal of this project is quality and stability, not chasing shiny language features, and I feel like the latest Rust language trajectory might be at the point of "chasing the shiny things" more than I'm comfortable with. Furthermore, if it turns out that the LLVM team really has no incentive to ensure quality on the RV32 target, one could reasonably expect things only get worse from here.