Long Impulse Chain Causes Adding/Removing ProtoFlux Nodes to Hang
Describe the bug?
When creating a long chain of impulse nodes which have inter-connected graph structure, Resonite hangs for an extremely long time (up to a minute sometimes) when adding/removing additional nodes, as well as spawning the object that contains such a graph.
This makes semi-complex algorithms untenable, since it becomes physically impossible to work on them.
To Reproduce
Spawn this reproduction item (84 nodes):
resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D
Under Case A
- Notice the immediate lag on spawn
- Try to add an output node
- Try to add an input node
- Try to break connections
The above example is more of a pathological case to try to show the issue in an isolated form.
Here is the original real-world example of the same issue (harder to parse):
resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D
Under Case B
Expected behavior
The length of an impulse chain or structure of a ProtoFlux graph should not make a significant difference in how long it takes to add/remove nodes to a graph nor in saving/loading. Possibly this is due to some process hanging on completing in a single frame which takes significantly longer when the ProtoFlux graph is more complex.
In any case, building semi-complex algorithms should not freeze Resonite while being constructed.
Screenshots
No response
Resonite Version Number
2024.3.1.1178
What Platforms does this occur on?
Windows
What headset if any do you use?
Valve Index, Desktop
Log Files
Reproduction A: Long Impulse Compile Hang A2 - 2024.3.1.1178 - 2024-03-10 09_41_37.log
Reproduction B: Long Impulse Compile Hang B - 2024.3.1.1178 - 2024-03-10 09_44_43.log
Additional Context
No response
Reporters
LuxKitty
This is about ProtoFlux node group rebuild time, and as shown in the logs it takes up to 15 seconds to rebuild 1 node group in one case. This is possibly exposing an inefficiency in the node group rebuild algorithm.
I guess this is a more specific version of my #989 report 🤔
Can you provide a screenshot of what the node group looks like please?
I took screenshots of the items in the public folder for illustration purposes.
Node group A
Node group B
Thanks!
Thank you for posting those, I was just about to when I saw this in the feed. Sorry I didn't think to include them since they were pretty long.
Also thank you for looking into this Froox!
An easy way to get this as well is by putting both outputs of a node into the next in a chain like this:
After enough nodes any change that gets made to the chain will lag quite a lot.
This example is available here:
resrec:///U-marsmaantje/R-a41acd26-4114-4940-9947-2e1cc8d01c98
I agree- this is likely the same underlying issue as #989, @Banane9.
and easy way to get this as well is by putting both outputs of a node into the next in a chain like this:
Ohh, I actually have a good chunk of those in the setup from my issue too.
Perhaps that causes it to evaluate everything multiple times, despite leading to the same node? Or get stuck in some sort of loop until a depth limit? 🤔
Going off of what @marsmaantje posted, I've added a third much more reduced example to the folder resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D under Case C:
This immediately causes a ~5+ second hang with only 10 nodes.
Then a fourth even laggier one in Case D with 7 nodes which hangs for ~20 seconds:
As a temporary workaround to this issue, brecert and I found that adding either a CallRelay or AsyncCallRelay to your node group and having it loop back to the start of the flux (exact placement will probably vary by design and complexity) you can reduce or even completely negate the lag from adding additional nodes. This node group took 1.3006 ms to rebuild
be very mindful when doing this to make sure that the flux will -never- actually loop back on itself, as that will cause you to crash and lose work!
Edit: Also upon further investigation, I discovered that if using async flux, you will need to connect it to before the last StartAsyncTask used, as they seem to invalidate the call relay's effect. To expand on that a little more basically anything that is similar to an AsyncCall will probably throw a wrench in things, so if you have an AsyncDynamicImpulseReciever make sure to add a StartAsyncTask to intercept that and connect your loop to that.
I've hidden this as outdated because I think it was more of an edge case to the problem rather than any actual solution
I looked into this by using ProtoFlux.Core directly.
The original code
The offending code is ProtoFlux.Core.ImpulseValidator.ValidateImpulseFlow(). The recursion that's done to check for continuation chains is extremely suboptional, and is in fact asymptotically exponential with the number of nodes and amount of outputs in each node connected to the next node in the chain.
Additionally, here's a speedscope of running the node validation function against four separate sequences of nodes with loops and without loops (you can view this in speedscope.app). Most of the time is spent in the "base" function, indicating rampant and out-of-control recursion.
The oddity?
It's a little difficult to go through the code for this function in decompilation due to the control flow getting tossed and turned (and the half-implemented nested node code inducing more clutter), but in general, I notice one major oddity in this function: it tries to detect both continuation loops and sync -> async connections in the same pass. In theory, isn't necessarily an inherently bad thing, but I find that doing it like this leads to tangled code and thus easier to mess up recursion with. There's a lot of kept state during the recursion part
I can not point to any specific call or part and say "this is the cause" though, because I didn't read it too hard.
My fix
I decided to separate out the checking for continuation loops and the checking for sync->async connections to focus on one thing at a time. I think it led to a more understandable approach that was also easy to implement in linear time.
The testing grounds for my re-implementation of validation is here. You can run it with dotnet run -c release validator. It will go through a simulated sequence of 1-8 sequence nodes each with 1-8 outputs per node going into the next chain in the sequence.
observe the stark runtime difference in the higher end of this log output. My approach runs in linear time and barely breaks 1ms for a long time. In one case, I observed that 500 sequence nodes, each with 500 outputs all plugged into the next node took only 120ms to validate. This would be several heat-death-of-the-universes heat-death-of-the-universe levels of time in the current ProtoFlux.Core implementation.
I'm not going to go through the step-by-step algorithm here--I'm sure you can read code, but I will give the high-level overview of my solution:
- Think of the node group as a directed graph where
IOperationsare vertices and the possible continuations from saidIOperationare the edges. - Do a DFS of the graph to check for continuation loops
- During this DFS, add
IOperations to a set if they "continue" towards an async input, and add all non-nullCallorSyncResumptionimpulses to another set denoting "entry points" to check if they connect to an async input later on
- During this DFS, add
- After the DFS, do a pass of all the aforementioned "entry points" and see if any of the continuations out of them are part of the continues-towards-async-input set. If so, then a sync start is connected to an async node and is thus invalid.
My solution does not take into account impulse imports/exports, since those are part of the half-implemented nested nodes code and not currently useful to resonite, but they could easily be implemented in a similar manner to how async is kept track or something, probably.
Additionally, I made a mod that implements the same algorithm and can actually be viewed in-game. Here's a video of it in action:
https://github.com/user-attachments/assets/687fe4bb-0293-454e-b504-a8f6c4213cdb
The first 20 seconds is the vanilla behavior before switching to my version, then showing off that there's no more hitching. The rest of the video is showing a few minor things that indicate consistent behavior with the current validator. Not shown is me also testing some PossibleContinuations nodes (stopwatch, etc.) and getting the expected behavior matching vanilla. It also shows that my implementation indirectly fixes #2750 with it.
To my knowledge, my current implementation (as of v1.0.1) matches the vanilla validator almost-exactly, with the only inconsistency being that a chain of continuations going into an async node will not get flagged as invalid unless a sync "entry point" is specifically added to the chain. I feel as if this is better behavior than the current behavior, where any continuation chain >1 node feeding into an async input will be flagged as invalid no matter what, even when there is no entry point to the continuations.
Let me know if you have any questions/concerns/etc. with my approach.
Amazing, great work yosh :D
Seeing long chains of flow nodes, like DynVar assignments is very common and I think that reworking the validation code will reduce hitching even further when spawning objects or players join a session. Thank you for investing your time in this!
I wonder if it would be possible to formally prove the validator algorithm, and how difficult this would be.
I wonder if it would be possible to formally prove the validator algorithm, and how difficult this would be.
fundamentally, it boils down to a pretty simple graph problem, but what the graph actually is isn't what you'd expect at first
certain IOperations to some input nodes (e.g. Stopwatch) can have the PossibleContinuations attribute applied to them, which basically list the possible continuation paths that are possible with the given input. as such, the input impulses to each node are the vertices of a graph, rather than the nodes themselves.
I didn't concretely realize this until waking up this morning, and now that I did, I was able to heavily simplify the algorithm with the help of two helper methods:
GetImpulses(this NodeRuntime)enumerates over all theIOperations in a givenNodeRuntime(a.k.a. vertices of the graph)GetPossibleContinuations(this IOperation)enumerates all the possible continuationIOperations from the given inputIOperation(a.k.a. the edges of a given vertex)
this makes it effectively a bog-standard graph loop detection problem that also keeps track of what operation paths continue into an async input, which is then used to check if there's any sync entry points being plugged into something that continues to an async input.
if we assume that all the ProtoFlux API calls return their expected results, it can most likely easily be proven that the algorithm works
Simplifying the complex problem to a known standard programming problem is very nice and replaces a lot of custom code with a well-known algorithm, which is a lot more maintainable. Nice.