lua-http icon indicating copy to clipboard operation
lua-http copied to clipboard

Websocket internally errors "/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out"

Open Mehgugs opened this issue 6 years ago • 33 comments

This happens when receiving payloads from the websocket.

Shard-0 has stopped receiving: ("/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out\
stack traceback:\
\9[C]: in function 'assert'\
\9/usr/local/share/lua/5.3/http/websocket.lua:282: in function 'http.websocket.read_frame'\
\9/usr/local/share/lua/5.3/http/websocket.lua:394: in function </usr/local/share/lua/5.3/http/websocket.lua:386>\
\9[C]: in function 'xpcall'\
\9...s/marcus/documents/github/novus/lua/novus/shard/init.lua:183: in function 'novus.shard.messages'\
\9[C]: in function 'xpcall'\
\9.../marcus/documents/github/novus/lua/novus/client/init.lua:42: in function <.../marcus/documents/github/novus/lua/novus/client/init.lua:41>"

This happens regardless of if a timeout is passed to the socket.

The weird thing is when I request deflated messages (per message inflate/deflate inside my handling loop), I dont get this issue at all. The only time this happens is when I request raw json payloads. I'm using scm-0 from master (8ab5c30198446bbd8522acef200c1c4b8274328b) to fix the timeout issues on the release rock.

This might be due to the payload size, heuristically I know it's the biggest payload I'm likely to receive causing this error, however this is not an issue with other clients for the api I'm connecting to.

Mehgugs avatar Jan 31 '19 18:01 Mehgugs

Could you run with the environment variable SOCKET_DEBUG=1 set? I'm wondering if your server is actually sending the data.

daurnimator avatar Feb 01 '19 03:02 daurnimator

With SOCKET_DEBUG=1 it works correctly, and logs the payloads I expect to see, it only errors without the debug variable set.

Mehgugs avatar Feb 02 '19 04:02 Mehgugs

it only errors without the debug variable set.

That's extremely odd. The debug variable should only add logging, not change any behaviour. Maybe there is a race condition somewhere, and the extra time spent logging is changing the winner?

daurnimator avatar Feb 02 '19 04:02 daurnimator

That seems likely, is there a way for me to collect the debug data into a file?

Mehgugs avatar Feb 02 '19 04:02 Mehgugs

is there a way for me to collect the debug data into a file?

Just pipe/redirect it?

daurnimator avatar Feb 02 '19 04:02 daurnimator

https://ghostbin.com/paste/a6czk You can see it cut off reading here. It is crashing consistently it seems upon further inspection. I think I jumped the gun a little when it managed to reconnect previously.

Mehgugs avatar Feb 02 '19 05:02 Mehgugs

I added a xpcall handler to inspect the internals here is an example of a frame which errors:

{
    "RSV2": false,
    "RSV3": false,
    "RSV1": false,
    "MASK": false,
    "length": 12964,
    "FIN": true,
    "opcode": 1
}

Mehgugs avatar Feb 02 '19 19:02 Mehgugs

Any idea what could cause this? When this happens my receive loop breaks down and I dont know if i can handle it without reconnecting.

Mehgugs avatar Feb 21 '19 16:02 Mehgugs

Your pastebin link has now expired. Could you repost it? (consider using a gist?)

daurnimator avatar Mar 08 '19 00:03 daurnimator

I'll try get some more data (I've lost the old log). What seems to happen is that the websocket cannot read the frame in 0 timeout https://github.com/daurnimator/lua-http/blob/master/http/websocket.lua#L282 I've made this line retry after a cqueues.sleep() and it can take up to 5 attempts to read.

I should also note that I've not found any further clues on how to reproduce it aside from connecting without requesting compression. I think it's possibly some kind of race condition?

Mehgugs avatar Mar 08 '19 13:03 Mehgugs

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

if frame.MASK then
	local key = assert(sock:xread(4, "b", 0))
	frame.key = { key:byte(1, 4) }
end

do
	local data = assert(sock:xread(frame.length, "b", 0))
	if frame.MASK then
		frame.data = apply_mask(data, frame.key)
	else
		frame.data = data
	end
end

EDIT:

This is a very naive solution I have been using to try and get a handle on what happens:

local function should_retry(timer) if timer then return timer:status() == "pending" else return true end end

local function read_again(deadline, socket, ...)
    local tries = 0
    local data, msg, code = socket:xread(...)
    local timeout = deadline and deadline - monotime()
    local timer = timeout and promise.new(sleep, timeout)
    while should_retry(timer) and (not data or code == ce.ETIMEDOUT) do
        tries = tries + 1
        sleep()
        data, msg, code = socket:xread(...)
    end
    used_tries = max(used_tries, tries)
    return data, msg, code
end

Calling this function on the socket in place of the xreads and logging the used_tries :

websocket.read_again used 5324 retries.

This makes me think the read_frame is not consuming the deadline somewhere?

Mehgugs avatar Mar 10 '19 00:03 Mehgugs

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

There is a preceding :fill that will ensure that there are enough bytes waiting in the socket buffer.

When the :fill fails, we poll for readability, and then tail call ourselves to try again: https://github.com/daurnimator/lua-http/blob/1f308460f3b981633541424abe0adaf4f0474d88/http/websocket.lua#L229

daurnimator avatar Mar 10 '19 06:03 daurnimator

Are you still having problems? If I'm unable to replicate your issue I'll have to close it.

daurnimator avatar Apr 28 '19 16:04 daurnimator

Hello, im so sorry i've not been more proactive with this. I'll try and get a reproducible example to work, and as far as im aware yes there's still something strange happening (it is mostly mitiaged right now by requesting compressed messages from the websocket but it by no means fixed).

Mehgugs avatar May 08 '19 17:05 Mehgugs

I think being sent a large enough message over websocket is enough to trigger this. EDIT:

Mon May 13 14:32:22 2019 INF  uncompressed payload size 6675863 bytes

It is usually this payload which causes this error when it is received by my client without compression.

Mehgugs avatar May 13 '19 14:05 Mehgugs

Could you please come up with a test case that I can run?

daurnimator avatar May 14 '19 00:05 daurnimator

i also have this issue With litcord run the sample It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

deltanedas avatar Aug 02 '19 22:08 deltanedas

With litcord run the sample It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

Could you write a reduced test case? Trying to run that example I just get "Authentication failed".

daurnimator avatar Aug 03 '19 04:08 daurnimator

You need to get a bot token by making a discord application.

deltanedas avatar Aug 03 '19 10:08 deltanedas

Very odd. I tried it many times today and it usually times out but sometimes works. Seems to be discord having a very low timeout.

deltanedas avatar Aug 03 '19 12:08 deltanedas

Someone more proficient at lua-http websockets might try setting up a simple server which sends a payload as big as what the server, discord in both cases here, sends.

Mehgugs avatar Aug 05 '19 14:08 Mehgugs

I'm getting the same error Seems to happen when there's a relatively large amount of data being received (in this case, frame.length was around 17KB) changing the timeout from 0 to 1 on line 282 (local data = assert(sock:xread(frame.length, "b", 0)) -> local data = assert(sock:xread(frame.length, "b", 1))) seems to fix it (but I'm not sure if that works all the time, and maybe it causes other problems)

12Me21 avatar Dec 19 '19 08:12 12Me21

Got the same error frame => { FIN = true, MASK = true, RSV1 = false, RSV2 = false, RSV3 = false, length = 1265, opcode = 1 } assert (sock: xread (frame.length, "b", 0)) I got an error here.

The client version is RFC7692, is it irrelevant?

Kagamihara avatar Feb 20 '20 10:02 Kagamihara

I have the same problem on websocket.lua:282: read: Connection timed out. i run the receive loop inside a cqueue container.

cleoold avatar May 24 '20 22:05 cleoold

Hi there, I ran into this issue, here's a small test case to reproduce. I just use a public echo server to send a large-ish string of data (about 12kb) and check that I receive it.

local http_websocket = require'http.websocket'
local large_str = 'some text! '

for i=1,11 do
  large_str = large_str .. large_str
end

local client = http_websocket.new_from_uri('wss://echo.websocket.org:443')
client:connect()
client:send(large_str,'text')
local data, typ = client:receive()
assert(string.len(data) == string.len(large_str))

jprjr avatar Dec 19 '20 21:12 jprjr

I have gained an interest in fixing this.

sharpobject avatar Jan 03 '21 19:01 sharpobject

Also been getting that exact same error, did not check that well yet but average response sizes are varying around 70kb but sometimes getting larger. Not sure if it is just size here as I've not tested at all yet if I can consistently reproduce error with some specific response but thought I'll leave this here just for stats.

S-S-X avatar May 10 '21 08:05 S-S-X

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

luainin avatar Sep 21 '21 19:09 luainin

Would you be able to write some code that launches a websocket server and reproduces this issue using a sample packet? It's really unreliable for me -- sometimes it happens, sometimes it doesn't.

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

Mehgugs avatar Sep 22 '21 19:09 Mehgugs

Sorry, just got back to this -- been pretty busy. Yeah, I can try -- I think I can make something pretty simple ( a C# console application) to demonstrate this

luainin avatar Sep 27 '21 18:09 luainin