SquadJS icon indicating copy to clipboard operation
SquadJS copied to clipboard

SquadRcon.onData Crashing Recently

Open ect0s opened this issue 3 years ago • 1 comments

Description of Issue

Crashed twice in the past two days on the following callstack; One immediately after a server crash, so I assumed it was related to a missing packet from the crash.

However, this morning we crashed shortly after hitting live during Fallujah Skimish v2

Errors or Screenshots of Issue

file:///opt/SquadJS/core/rcon.js:80 this.responseCallbackQueue.shift()( ^

TypeError: this.responseCallbackQueue.shift(...) is not a function at SquadRcon.onData (file:///opt/SquadJS/core/rcon.js:80:49) at Socket.emit (events.js:315:20) at addChunk (internal/streams/readable.js:309:12) at readableAddChunk (internal/streams/readable.js:284:9) at Socket.Readable.push (internal/streams/readable.js:223:10) at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

Squad Information

  • SquadJS Version: v3
  • Squad Version: 2.9

If potentially relevant, please provide regarding the state of the Squad server at the time of error, e.g. the current layer.

System Information

  • Operating System: Linux
  • NodeJS Version: Node 14
  • Yarn Version:

ect0s avatar Jul 04 '21 16:07 ect0s

I'm having this same issue, however off of Windows. Did you ever solve this yourself?

image

enderneuer avatar Sep 11 '22 06:09 enderneuer

Is this still an issue?

werewolfboy13 avatar Dec 29 '22 00:12 werewolfboy13

Is this still an issue?

Yes, though its rare.

ect0s avatar Jan 07 '23 17:01 ect0s

Resolved on #258

werewolfboy13 avatar Jan 13 '23 21:01 werewolfboy13

I've root caused this, and the issue mentioned in

https://github.com/Team-Silver-Sphere/SquadJS/issues/249.

The essentials are this:

We are using the 'endpackets' to fire off the callbacks pending for each execute request,

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L80

These endpackets are generated here:

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L323

As described in:

https://developer.valvesoftware.com/wiki/Source_RCON_Protocol#Multiple-packet_Responses

The issue lies in part in here:

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L119

We expect to receive a series of two packets,

One length 10 with no extra data, one with extra data (0x0000 0001 0000 0000).

We attempt to discard this empty data here: https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L135

However, getting to that check requires that we pass multiple conditions we can sometimes fail

We DO not receive packets from nodejs, we receive chunks of a stream. This at times means that the packets can span two or more onData events.

https://nodejs.org/api/net.html#class-netsocket

Given a sequence such as the following:

0a 00 00 00 02 00 00 00 00 00 00 00 00 00 0a 00 00 00 02 00 00 00 00 00 00 00 00 00 | 00 01

We will split out two "endpackets" and leave the junk data 00 01 in the buffer.

In fact, this can happen in several cases where we do not actually receive the rest of the 2nd packet in the current chunk we are processing.

In the above case, we split off packet one:

0a 00 00 00 02 00 00 00 00 00 00 00 00 00 0a 00 00 00 02 00 00 00 00 00 00 00 00 00 | 00 01

And fulfill its callback,

then

0a 00 00 00 02 00 00 00 00 00 00 00 00 00 | 00 01

And attempt to fulfill a second callback

Followed by prepending the left over data onto the next packet 00 01

This leads to the issue in https://github.com/Team-Silver-Sphere/SquadJS/issues/249, where we have a junk packet. And from this point on, all our packets are likely to be junk because we are now splitting the data incorrectly.

This was obvious when all of our junk packets had a 'size' of 256, or 00 01 00 00

As for the issue posted here,

If the callBackQueue size is 1, and we attempt to fulfill a second callback, we get the issues as raised here

TypeError: this.responseCallbackQueue.shift(...) is not a function.

We shift an empty array, get back undefined (not a function), and cannot invoke it.

If the callback queue is is 2, we instead send an empty response to the waiting callback

"[SquadServer][1] Failed to update layer information. TypeError: Cannot read property '1' of null
    at SquadRcon.getNextMap 

From this point forward, again, we are now out of sync.

We will fail to match data on all incoming requests when we process the bodies back into results.

We can also deadlock if the Squad restarts while we have pending callbacks, as again, we do not clear these and are blindly assuming the incomingData buffer and incomingResponse buffers are correct.

I've checked multiple logs since finding a way to reproduce this issue and have never seen anything related to

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L285

As for the window for these errors, and why they are rare:

We need to fail the check here:

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L129

If we have say, 15, or 16 bytes, we will pass through to

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L149

Where we cut the packet, and leave the extra 1-2 bytes in the buffer.

We then fail the "while" check here:

https://github.com/Team-Silver-Sphere/SquadJS/blob/164bad3e4cc8e5d2097079bcb741778053b1b338/core/rcon.js#L119

ect0s avatar Jan 13 '23 22:01 ect0s

I have a fix in progress for this issue, however the wider case is it still won't prevent all deadlocks, and may actually cause more when squad restarts while we have pending requests.

IE, we have sent some requests, squad crashes or gets restarted in the seconds before we have gotten back a full response.

ect0s avatar Jan 13 '23 22:01 ect0s

Fixed

werewolfboy13 avatar Dec 29 '23 00:12 werewolfboy13