nex-go icon indicating copy to clipboard operation
nex-go copied to clipboard

Revert connection state checks

Open jonbarrow opened this issue 1 year ago • 5 comments
trafficstars

Draft for now since there are still unsolved issues.

Reverts the ConnectionState checks on the server. This seems to only be validated on the client. This was a flawed check and introduced some connection issues. The server would update the the clients connection state at times when it could not verify that the action happened successfully. Namely in the SYN and CONNECT packet handlers. It's possible for the SYN ACK or CONNECT ACK to get dropped and never reach the client, resulting in the client trying to resend those packets. However the server has already updated the clients connection state to go to the next step, and would drop any incoming packets resent from the previous state, dropping an otherwise good connection.

This issue occured several times for me during local testing, and last night @ashquarky confirmed that removing these checks seemed to clear up some connection issues for them and many others when done in prod.

This is also not accurate emulation. As seen by this screenshot from the Nintendo Network Friends server, the server seems to always respond to packets regardless of the clients connection state. Which seems to imply it doesn't track it at all. In this case the SYN packet was resent and resulted in 2 SYN ACK packets from the server.

Screenshot_from_2024-04-23_15-35-42

However removing these checks alone has introduced a regression it seems. I have hot-patched these changes onto the live friends server and the logs now show many RMC Message has unexpected size errors. I believe this is due to the clients RMC ciphers not being reset correctly, but I'm unsure?

In the SYN packet handler the PRUDPConnection.reset function is called, which runs pc.slidingWindows.Clear. This SHOULD be removing all the clients RMC ciphers, but clearly something is still going wrong? Due to this regression, this has been marked as a draft for now.

jonbarrow avatar Apr 27 '24 14:04 jonbarrow

I'm not keen with removing all the connection checks. Considering this only an issue on the SYN stage, can't we just only skip the checks there?

DaniElectra avatar Apr 27 '24 15:04 DaniElectra

can't we just only skip the checks there?

The issue isn't limited to just the SYN. The server also sets the clients connection state to StateConnected after it sends the CONNECT ACK, but this can also fail. The CONNECT ACK can fail to reach the client, making it resend the CONNECT, but the server then rejects it due to the state being out of sync. And if we removed the check on both the SYN and CONNECT, then why bother checking at all?

Also if I remember correctly the point of the connection state checks in the first place was to remove an old hack related to resetting the clients data on the SYN packet, but we still do this anyway. The old hack reset the client on SYN and forced there to be 1 substream to prevent a panic, but the new implementation uses the SlidingWindow and those are created on the fly if they don't exist (removing the need for the "always create a substream" hack)

Also even if we just removed the SYN check we'd still run into the RMC Message has unexpected size error, since the cipher state seems to not be getting reset correctly. Looking at the code again it looks like this is because the clients StreamSettings is not being reset (which means that the StreamSettings.EncryptionAlgorithm.Decrypt and StreamSettings.EncryptionAlgorithm.Encrypt functions have out-of-date state)

It also just doesn't seem to be accurate emulation on our part

jonbarrow avatar Apr 27 '24 15:04 jonbarrow

Also it's worth noting that the hot-patch on the friends server only removes the state check from:

  • handleAcknowledgment
  • handleSyn
  • handleConnect

handleData still has the check, and the RMC Message has unexpected size error still occurs.

I also hot-patched in pc.StreamSettings = pc.endpoint.DefaultStreamSettings.Copy() in PRUDPConnection.reset, as I assumed the error was happening due to bad stream settings state. But this also did not seem to help, the error still occurs.

I cannot replicate the RMC Message has unexpected size error locally, but these changes clearly cause it. Before the hot patched changes that error only occurred ~200 times, and after applying the changes it has occurred over 15,000 times

jonbarrow avatar Apr 27 '24 19:04 jonbarrow

Might I suggest hotpatching the unexpected size error (here and here) to actually print the size, then letting the server run a bit? Something like:

if stream.Remaining() != uint64(length) {
	return fmt.Errorf("RMC Message has unexpected size (%v instead of %v)", length, stream.Remaining())
}

This would confirm whether the issue is cipher-related as suspected, or if it's actually some other root cause (e.g. the length may be off by a few bytes, or the stream may have been unexpectedly consumed/truncated).

ashquarky avatar May 05 '24 06:05 ashquarky

This would confirm whether the issue is cipher-related as suspected, or if it's actually some other root cause (e.g. the length may be off by a few bytes, or the stream may have been unexpectedly consumed/truncated).

If it was an issue like that then we would have expected to see this error in logs already at the same frequency

Before this revert that error WAS in the server logs, but only ~200 times. After making these changes, it jumped to ~15,000

So it's definitely related to these changes, and the only real time that would happen is if the payload didn't decrypt properly

Also jsut letting these payloads pass through would almost certainly be a much bigger headache to deal with. If the sizes do not match then the RMC payload is bad, and trying to pass it through any further logic will almost definitely have it run into more issues (ranging from much more cluttered logs, as the expected data would not be found, to possible panics as the data itself is invalid)

jonbarrow avatar May 05 '24 12:05 jonbarrow

@DaniElectra As a compromise for removing the state checks, rather than checking "is your state equal", what if we checked "should you be able to do this at all"?

Meaning, rather than rejecting a SYN packet once connected, we just reject packets we know would be impossible in this stage of the connection.

So that means:

  • If StateNotConnected, only allow SYN
  • If StateConnecting, allow both SYN and CONNECT
  • If StateConnected, allow all packet types

This would still allow us to drop DATA packets from connections we've timed out on our end, while still allowing for the client to resend any state-changing packets it may need to

jonbarrow avatar May 17 '24 17:05 jonbarrow

That sounds good to me 👍

DaniElectra avatar May 17 '24 17:05 DaniElectra

Closing in favor of #65

jonbarrow avatar Jun 07 '24 21:06 jonbarrow