nex-go icon indicating copy to clipboard operation
nex-go copied to clipboard

[Enhancement]: Refactor client/state/socket management

Open jonbarrow opened this issue 5 months ago • 1 comments

Checked Existing

  • [x] I have checked the repository for duplicate issues.

What enhancement would you like to see?

Right now our client/state/socket management is somewhat sloppy and incorrectly implemented in some areas. I'm starting this issue to serve as a place to point out some key areas I've noticed, and then to act as a place of discussion for how to refactor the way we do things. Not everything discussed here needs to be done in a single PR, but we should use this as a place to create a work map

For context, according to the friends server we have had 1,534,812 connections in the past 20 days. At an absolute bare minimum of 1 packet per connection (obviously much higher than that in reality), that's 1,534,812 packets that we've processed

Some key points off the top of my head:

  1. The way we handle the transport layer is very verbose and hacky. We have 3 different functions for starting a server, listenDatagram, ListenWebSocket and ListenWebSocketSecure, and pass around a webSocketConnection variable between multiple functions which may or may not be nil just in case we're dealing with a WebSocket. This should likely be changed to function like how we have our packet compression and encryption, using interfaces to define the methods we'd need in a Transport interface and then implement the UDP, WS and WSS transports as structs which implement that interface
  2. We are creating goroutines very liberally which introduces some performance and state management issues. Goroutines are great, easy, and cheap, but they are not free. We create a goroutine for every CPU core on the system in order to accept packets concurrently, then each time a packet arrives we spawn another goroutine, then each (reliable) packet we send back we spawn another goroutine for the packet timeout system, etc. Under heavy load this can create a LOT of goroutines, which can eat through memory and can cause slowdowns as the Go runtime tries to switch between these goroutines. We should look into ways to limit the number of goroutines we're using, such as creating a worker pool either globally or on each socket/VPort or channels for sharing. There is a Go idiom which says "Do not communicate by sharing memory; instead, share memory by communicating", which encourages the use of channels. See https://go.dev/blog/codelab-share and https://go.dev/doc/effective_go#sharing for more info
  3. We make heavy use of mutexes for locking access due to the aforementioned goroutines. We should probably explore other ways to handle data access/modifications without them where possible, such as the use of channels for synchronization rather than locks or by updating data inside a single goroutine rather than multiple (IE, one goroutine per connection, and only that goroutine can modify its data). We should also explore using atomic operations in more places (such as atomic.AddUint32 rather than locking with a mutex when adding to a value)
  4. We create a new SocketConnection for every packet even if we've already seen the socket before. At our current numbers this means we've allocated this struct at minimum 1,534,812 times, resulting in millions of wasted allocations, CPU cycles, etc. If I remember correctly this was done to remove the VPort tracking on sockets, since it was broken anyway (https://github.com/PretendoNetwork/nex-go/issues/52), but has introduced new issues. We should probably look into reintroducing this so that we avoid creating a new SocketConnection on every packet
  5. For some reason we still seem to be having issues with "stale" connections. Checking the logs for the friends server and Splatoon recently, there are LOTS of errors/warnings such as Tried to delete connection REDACTED (ID REDACTED) but it doesn't exist! and [func SendNotificationEvent() github.com/PretendoNetwork/nex-protocols-common-go/v2/globals/utils.go:164] : Client not found, amongst others. I believe this is almost certainly related to stale connections, and I think it has something to do with our timeout/retransmission/heartbeat system. We can probably simplify this a lot by introducing the use of channels for signals in some areas
  6. Our packet timeout/retransmission system does not 100% correctly free resources when it should. When a packet is acknowledged, the timeouts cancel function is not called. Similarly, the timeout manager creates new contexts when the old ones expire without calling the old contexts cancel function. Both of these can lead to the contexts memory not being freed right away, and can cause a goroutine to remain parked indefinitely until things clear up (since we use goroutines for everything)
  7. In the past we have had issues with long-running requests hogging resources. This is because we don't have a proper request timeout system implemented. We should definitely start using contexts when we get a packet, which is passed down the chain so that game servers can use it to time out long-running tasks properly like database queries
  8. Client state is stored in a MutexMap indefinitely, and is only ever removed inside the timeout mechanism. But this can fail, and result in data never being removed

I'm sure there are other issues as well with the way we handle client/state/socket management, but these are just the few areas I personally saw and wanted to touch on

TLDR:

  • We should make a Transport interface rather than the hacky manual transport layer implementation we currently have
  • We should start passing contexts in packets so game servers can signal cancellations at the PRUDP layer for long-running tasks]
  • We should look into using channels more, as well as better worker pool designs, to try and reduce the number of goroutines we create
  • We should look into ways to remove the number of mutexes and locks we have, again through the use of smarter channels/goroutine/atomics usages
  • Packet timeout/retransmission has issues

I'm going to CC @DaniElectra and @ashquarky on this one specifically to get input and to encourage discussion here. The above are just some areas I personally saw, if you guys know of other areas that need work or have other solutions to the above problems I'm happy to hear those as well :+1:

Any other details to share? (OPTIONAL)

No response

jonbarrow avatar Jun 16 '25 23:06 jonbarrow

About stale connections, I know that some of them are created when a client only sends a SYN packet and nothing else. That leaves the connection registered on the server forever because there is no timeout since it only gets activated when receiving the CONNECT packet. That still doesn't explain the mentioned issue, so there must be something else happening aswell

DaniElectra avatar Jul 01 '25 14:07 DaniElectra