nex-go
nex-go copied to clipboard
Revert connection state checks
Draft for now since there are still unsolved issues.
Reverts the ConnectionState checks on the server. This seems to only be validated on the client. This was a flawed check and introduced some connection issues. The server would update the the clients connection state at times when it could not verify that the action happened successfully. Namely in the SYN and CONNECT packet handlers. It's possible for the SYN ACK or CONNECT ACK to get dropped and never reach the client, resulting in the client trying to resend those packets. However the server has already updated the clients connection state to go to the next step, and would drop any incoming packets resent from the previous state, dropping an otherwise good connection.
This issue occured several times for me during local testing, and last night @ashquarky confirmed that removing these checks seemed to clear up some connection issues for them and many others when done in prod.
This is also not accurate emulation. As seen by this screenshot from the Nintendo Network Friends server, the server seems to always respond to packets regardless of the clients connection state. Which seems to imply it doesn't track it at all. In this case the SYN packet was resent and resulted in 2 SYN ACK packets from the server.
However removing these checks alone has introduced a regression it seems. I have hot-patched these changes onto the live friends server and the logs now show many RMC Message has unexpected size errors. I believe this is due to the clients RMC ciphers not being reset correctly, but I'm unsure?
In the SYN packet handler the PRUDPConnection.reset function is called, which runs pc.slidingWindows.Clear. This SHOULD be removing all the clients RMC ciphers, but clearly something is still going wrong? Due to this regression, this has been marked as a draft for now.
I'm not keen with removing all the connection checks. Considering this only an issue on the SYN stage, can't we just only skip the checks there?
can't we just only skip the checks there?
The issue isn't limited to just the SYN. The server also sets the clients connection state to StateConnected after it sends the CONNECT ACK, but this can also fail. The CONNECT ACK can fail to reach the client, making it resend the CONNECT, but the server then rejects it due to the state being out of sync. And if we removed the check on both the SYN and CONNECT, then why bother checking at all?
Also if I remember correctly the point of the connection state checks in the first place was to remove an old hack related to resetting the clients data on the SYN packet, but we still do this anyway. The old hack reset the client on SYN and forced there to be 1 substream to prevent a panic, but the new implementation uses the SlidingWindow and those are created on the fly if they don't exist (removing the need for the "always create a substream" hack)
Also even if we just removed the SYN check we'd still run into the RMC Message has unexpected size error, since the cipher state seems to not be getting reset correctly. Looking at the code again it looks like this is because the clients StreamSettings is not being reset (which means that the StreamSettings.EncryptionAlgorithm.Decrypt and StreamSettings.EncryptionAlgorithm.Encrypt functions have out-of-date state)
It also just doesn't seem to be accurate emulation on our part
Also it's worth noting that the hot-patch on the friends server only removes the state check from:
handleAcknowledgmenthandleSynhandleConnect
handleData still has the check, and the RMC Message has unexpected size error still occurs.
I also hot-patched in pc.StreamSettings = pc.endpoint.DefaultStreamSettings.Copy() in PRUDPConnection.reset, as I assumed the error was happening due to bad stream settings state. But this also did not seem to help, the error still occurs.
I cannot replicate the RMC Message has unexpected size error locally, but these changes clearly cause it. Before the hot patched changes that error only occurred ~200 times, and after applying the changes it has occurred over 15,000 times
Might I suggest hotpatching the unexpected size error (here and here) to actually print the size, then letting the server run a bit? Something like:
if stream.Remaining() != uint64(length) {
return fmt.Errorf("RMC Message has unexpected size (%v instead of %v)", length, stream.Remaining())
}
This would confirm whether the issue is cipher-related as suspected, or if it's actually some other root cause (e.g. the length may be off by a few bytes, or the stream may have been unexpectedly consumed/truncated).
This would confirm whether the issue is cipher-related as suspected, or if it's actually some other root cause (e.g. the length may be off by a few bytes, or the stream may have been unexpectedly consumed/truncated).
If it was an issue like that then we would have expected to see this error in logs already at the same frequency
Before this revert that error WAS in the server logs, but only ~200 times. After making these changes, it jumped to ~15,000
So it's definitely related to these changes, and the only real time that would happen is if the payload didn't decrypt properly
Also jsut letting these payloads pass through would almost certainly be a much bigger headache to deal with. If the sizes do not match then the RMC payload is bad, and trying to pass it through any further logic will almost definitely have it run into more issues (ranging from much more cluttered logs, as the expected data would not be found, to possible panics as the data itself is invalid)
@DaniElectra As a compromise for removing the state checks, rather than checking "is your state equal", what if we checked "should you be able to do this at all"?
Meaning, rather than rejecting a SYN packet once connected, we just reject packets we know would be impossible in this stage of the connection.
So that means:
- If
StateNotConnected, only allowSYN - If
StateConnecting, allow bothSYNandCONNECT - If
StateConnected, allow all packet types
This would still allow us to drop DATA packets from connections we've timed out on our end, while still allowing for the client to resend any state-changing packets it may need to
That sounds good to me 👍
Closing in favor of #65