Resonite-Issues icon indicating copy to clipboard operation
Resonite-Issues copied to clipboard

Long Impulse Chain Causes Adding/Removing ProtoFlux Nodes to Hang

Open lxw404 opened this issue 1 year ago • 16 comments

Describe the bug?

When creating a long chain of impulse nodes which have inter-connected graph structure, Resonite hangs for an extremely long time (up to a minute sometimes) when adding/removing additional nodes, as well as spawning the object that contains such a graph.

This makes semi-complex algorithms untenable, since it becomes physically impossible to work on them.

To Reproduce

Spawn this reproduction item (84 nodes): resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D Under Case A

  • Notice the immediate lag on spawn
  • Try to add an output node
  • Try to add an input node
  • Try to break connections

The above example is more of a pathological case to try to show the issue in an isolated form.

Here is the original real-world example of the same issue (harder to parse): resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D Under Case B

Expected behavior

The length of an impulse chain or structure of a ProtoFlux graph should not make a significant difference in how long it takes to add/remove nodes to a graph nor in saving/loading. Possibly this is due to some process hanging on completing in a single frame which takes significantly longer when the ProtoFlux graph is more complex.

In any case, building semi-complex algorithms should not freeze Resonite while being constructed.

Screenshots

No response

Resonite Version Number

2024.3.1.1178

What Platforms does this occur on?

Windows

What headset if any do you use?

Valve Index, Desktop

Log Files

Reproduction A: Long Impulse Compile Hang A2 - 2024.3.1.1178 - 2024-03-10 09_41_37.log

Reproduction B: Long Impulse Compile Hang B - 2024.3.1.1178 - 2024-03-10 09_44_43.log

Additional Context

No response

Reporters

LuxKitty

lxw404 avatar Mar 10 '24 14:03 lxw404

This is about ProtoFlux node group rebuild time, and as shown in the logs it takes up to 15 seconds to rebuild 1 node group in one case. This is possibly exposing an inefficiency in the node group rebuild algorithm.

Nytra avatar Mar 10 '24 17:03 Nytra

I guess this is a more specific version of my #989 report 🤔

Banane9 avatar Mar 10 '24 17:03 Banane9

Can you provide a screenshot of what the node group looks like please?

Frooxius avatar Mar 11 '24 18:03 Frooxius

I took screenshots of the items in the public folder for illustration purposes.

Node group A image

Node group B image

DoubleStyx avatar Mar 11 '24 18:03 DoubleStyx

Thanks!

Frooxius avatar Mar 11 '24 19:03 Frooxius

Thank you for posting those, I was just about to when I saw this in the feed. Sorry I didn't think to include them since they were pretty long.

Also thank you for looking into this Froox!

lxw404 avatar Mar 11 '24 19:03 lxw404

An easy way to get this as well is by putting both outputs of a node into the next in a chain like this: image After enough nodes any change that gets made to the chain will lag quite a lot.

This example is available here: resrec:///U-marsmaantje/R-a41acd26-4114-4940-9947-2e1cc8d01c98

marsmaantje avatar Mar 11 '24 22:03 marsmaantje

I agree- this is likely the same underlying issue as #989, @Banane9.

LexiBasilisk avatar Mar 12 '24 00:03 LexiBasilisk

and easy way to get this as well is by putting both outputs of a node into the next in a chain like this:

Ohh, I actually have a good chunk of those in the setup from my issue too.

Perhaps that causes it to evaluate everything multiple times, despite leading to the same node? Or get stuck in some sort of loop until a depth limit? 🤔

Banane9 avatar Mar 12 '24 00:03 Banane9

Going off of what @marsmaantje posted, I've added a third much more reduced example to the folder resrec:///U-LuxKitty/R-08D6E662E744592499BFF88483CB3A6E931230638A07A3BFA9AE5DE4A745A25D under Case C:

Resonite_hE54CxnUvj

This immediately causes a ~5+ second hang with only 10 nodes.

Then a fourth even laggier one in Case D with 7 nodes which hangs for ~20 seconds:

Resonite_c5s5edxLZN

lxw404 avatar Mar 12 '24 21:03 lxw404

As a temporary workaround to this issue, brecert and I found that adding either a CallRelay or AsyncCallRelay to your node group and having it loop back to the start of the flux (exact placement will probably vary by design and complexity) you can reduce or even completely negate the lag from adding additional nodes. This node group took 1.3006 ms to rebuild

Image

be very mindful when doing this to make sure that the flux will -never- actually loop back on itself, as that will cause you to crash and lose work!

Edit: Also upon further investigation, I discovered that if using async flux, you will need to connect it to before the last StartAsyncTask used, as they seem to invalidate the call relay's effect. To expand on that a little more basically anything that is similar to an AsyncCall will probably throw a wrench in things, so if you have an AsyncDynamicImpulseReciever make sure to add a StartAsyncTask to intercept that and connect your loop to that.

I've hidden this as outdated because I think it was more of an edge case to the problem rather than any actual solution

Dusty-Sprinkles avatar Mar 11 '25 06:03 Dusty-Sprinkles

I looked into this by using ProtoFlux.Core directly.

The original code

The offending code is ProtoFlux.Core.ImpulseValidator.ValidateImpulseFlow(). The recursion that's done to check for continuation chains is extremely suboptional, and is in fact asymptotically exponential with the number of nodes and amount of outputs in each node connected to the next node in the chain.

Here's an exponential fit of the validation time of a sequence chain of 9 nodes as the number of outputs of each node increases

And here's an exponential fit of the validation time where each sequence node has 6 outputs and the number of nodes increases

Additionally, here's a speedscope of running the node validation function against four separate sequences of nodes with loops and without loops (you can view this in speedscope.app). Most of the time is spent in the "base" function, indicating rampant and out-of-control recursion.

The oddity?

It's a little difficult to go through the code for this function in decompilation due to the control flow getting tossed and turned (and the half-implemented nested node code inducing more clutter), but in general, I notice one major oddity in this function: it tries to detect both continuation loops and sync -> async connections in the same pass. In theory, isn't necessarily an inherently bad thing, but I find that doing it like this leads to tangled code and thus easier to mess up recursion with. There's a lot of kept state during the recursion part

I can not point to any specific call or part and say "this is the cause" though, because I didn't read it too hard.

My fix

I decided to separate out the checking for continuation loops and the checking for sync->async connections to focus on one thing at a time. I think it led to a more understandable approach that was also easy to implement in linear time.

The testing grounds for my re-implementation of validation is here. You can run it with dotnet run -c release validator. It will go through a simulated sequence of 1-8 sequence nodes each with 1-8 outputs per node going into the next chain in the sequence.

observe the stark runtime difference in the higher end of this log output. My approach runs in linear time and barely breaks 1ms for a long time. In one case, I observed that 500 sequence nodes, each with 500 outputs all plugged into the next node took only 120ms to validate. This would be several heat-death-of-the-universes heat-death-of-the-universe levels of time in the current ProtoFlux.Core implementation.

I'm not going to go through the step-by-step algorithm here--I'm sure you can read code, but I will give the high-level overview of my solution:

  1. Think of the node group as a directed graph where IOperations are vertices and the possible continuations from said IOperation are the edges.
  2. Do a DFS of the graph to check for continuation loops
    • During this DFS, add IOperations to a set if they "continue" towards an async input, and add all non-null Call or SyncResumption impulses to another set denoting "entry points" to check if they connect to an async input later on
  3. After the DFS, do a pass of all the aforementioned "entry points" and see if any of the continuations out of them are part of the continues-towards-async-input set. If so, then a sync start is connected to an async node and is thus invalid.

My solution does not take into account impulse imports/exports, since those are part of the half-implemented nested nodes code and not currently useful to resonite, but they could easily be implemented in a similar manner to how async is kept track or something, probably.

Additionally, I made a mod that implements the same algorithm and can actually be viewed in-game. Here's a video of it in action:

https://github.com/user-attachments/assets/687fe4bb-0293-454e-b504-a8f6c4213cdb

The first 20 seconds is the vanilla behavior before switching to my version, then showing off that there's no more hitching. The rest of the video is showing a few minor things that indicate consistent behavior with the current validator. Not shown is me also testing some PossibleContinuations nodes (stopwatch, etc.) and getting the expected behavior matching vanilla. It also shows that my implementation indirectly fixes #2750 with it.

To my knowledge, my current implementation (as of v1.0.1) matches the vanilla validator almost-exactly, with the only inconsistency being that a chain of continuations going into an async node will not get flagged as invalid unless a sync "entry point" is specifically added to the chain. I feel as if this is better behavior than the current behavior, where any continuation chain >1 node feeding into an async input will be flagged as invalid no matter what, even when there is no entry point to the continuations.

Let me know if you have any questions/concerns/etc. with my approach.

yoshiyoshyosh avatar Nov 26 '25 11:11 yoshiyoshyosh

Amazing, great work yosh :D

Banane9 avatar Nov 26 '25 18:11 Banane9

Seeing long chains of flow nodes, like DynVar assignments is very common and I think that reworking the validation code will reduce hitching even further when spawning objects or players join a session. Thank you for investing your time in this!

I wonder if it would be possible to formally prove the validator algorithm, and how difficult this would be.

ColinTimBarndt avatar Nov 26 '25 19:11 ColinTimBarndt

I wonder if it would be possible to formally prove the validator algorithm, and how difficult this would be.

fundamentally, it boils down to a pretty simple graph problem, but what the graph actually is isn't what you'd expect at first

certain IOperations to some input nodes (e.g. Stopwatch) can have the PossibleContinuations attribute applied to them, which basically list the possible continuation paths that are possible with the given input. as such, the input impulses to each node are the vertices of a graph, rather than the nodes themselves.

I didn't concretely realize this until waking up this morning, and now that I did, I was able to heavily simplify the algorithm with the help of two helper methods:

  • GetImpulses(this NodeRuntime) enumerates over all the IOperations in a given NodeRuntime (a.k.a. vertices of the graph)
  • GetPossibleContinuations(this IOperation) enumerates all the possible continuation IOperations from the given input IOperation (a.k.a. the edges of a given vertex)

this makes it effectively a bog-standard graph loop detection problem that also keeps track of what operation paths continue into an async input, which is then used to check if there's any sync entry points being plugged into something that continues to an async input.

if we assume that all the ProtoFlux API calls return their expected results, it can most likely easily be proven that the algorithm works

yoshiyoshyosh avatar Nov 27 '25 16:11 yoshiyoshyosh

Simplifying the complex problem to a known standard programming problem is very nice and replaces a lot of custom code with a well-known algorithm, which is a lot more maintainable. Nice.

ColinTimBarndt avatar Nov 27 '25 18:11 ColinTimBarndt