XChange icon indicating copy to clipboard operation
XChange copied to clipboard

[Streaming-Core] frequent websocket disconnects

Open earce opened this issue 4 years ago • 17 comments

Hi all,

I have spent some time tracking (and attempting to improve) websocket connection reliability in XChange and wanted to detail what I have so far for anyone who may be experiencing the same thing (and anyone that may have time to also work on improving this).

TLDR; I am seeing websocket disconnects anywhere from hourly to every 3 hours to up to about 12 hours. This is a lot, a conversation with Kroitor @ CCXT made me aware of the fact that they can go several days without experiencing a disconnect (for CoinbasePro/Kraken). I suspect these disconnects are self inflicted.

We have an IdleReadTimeout (how long we go without reading anything over the websocket) which will trigger us to disconnect if we don't see anything for 15 seconds). I had added https://github.com/knowm/XChange/commit/3010143ce0d487756943d6178aefbe865ff91ef4 to send a ping because channels which weren't busy would disconnect every few minutes.

This improved the disconnect rate for me from every few minutes to every few hours but is still not what it should be. True websocket disconnects tend to be a little tricky to figure out correctly because if a server process dies without notifying the client of a shutdown/disconnect hook the client has to figure on its own if the channel is dead and this is a bit empirical, things like a long GC pause on the server size can make the client think the channel is dead but may not actually be, so read timeouts tend to be the best efforts.

In any case out of the disconnect's I see very rarely it is the server sending a close channel message which would print this log line LOG.info("WebSocket Client received closing! {}", ctx.channel());. Most disconnects just tend to be what appears to be an inactive channel.

In the current stage I have modified the IdleStateHandler with additional logging to see what the stacktrace is before a channel has inactive methods called. I have attached said stack_trace.log. What is logged as read message is basically us reading a message over the websocket channel, you see there is a unusual gap before we start triggering idle code. However this gap is much smaller then the idle timeout. TempIdleStateHandler is what I've replaced IdleStateHandler in the NettyStreamingService class with.

I still have not figured out what the source of this issue is but hopefully this can be useful to anyone who has noticed this.

earce avatar Nov 19 '20 17:11 earce

Small update I have tracked down what is triggering the channel closing in the Netty code. AbstractNioByteChannel.java set to close which triggers closeOnRead(pipeline). This appears like the websocket is entering a half open state but will update when I have more.

earce avatar Nov 30 '20 16:11 earce

Another update, after opening an issue with the Netty folks it seems like this is getting triggered cause the websocket is receiving an EOF https://github.com/netty/netty/issues/10830.

earce avatar Dec 05 '20 19:12 earce

This is very interesting. i am currently having similiar issues which make Xchange-Streaming almost unusuable to me ....

From 10 Exchanges tested, about half of them fail or have very bad issues.

Working as expected: Bitfinex Poloniex

Having this issue from time to time: Kraken
Hitbtc Gemini

Complaining a lot, or being unusable all:

Okcoin Coinmate

other issues: Cexio - complains about missing Apikeys, when they aren't actually missing LGO - ClassCastExcpetion when instantiating from StreamingExchangeFactory

i was wondering it was time to introduce some actuall deployment/automated tests for Xchange? As a new user it is rather frustrating to pull a fresh release and then see 50% of the Lib fail miserably :D

Currently nobody really knows which modules are having problems and why, as it is not tested or listed anywhere.

i guess using the Issues-System much more would help: e.g. tag things more detailed... most of the Issues here are not even tagged. E.g. this here is clearly a Bug-Report.

gewure avatar Dec 14 '20 15:12 gewure

I don't use most of the exchanges you've mentioned there so I can't opine but feel free to submit some PRs improving this!

earce avatar Dec 14 '20 17:12 earce

LGO sold to Voyager: https://www.prnewswire.com/news-releases/voyager-completes-merger-with-lgo-positioning-company-with-fully-regulated-european-platform-301190929.html

This may explain problems with API

mdvx avatar Dec 15 '20 00:12 mdvx

I have fixed, or configured around problems here Kraken Hitbtc Gemini

mdvx avatar Dec 15 '20 00:12 mdvx

@mdvx if these fixes don't exist in this codebase can you submit PRs?

earce avatar Dec 15 '20 00:12 earce

@earce, sometimes my fixes are radical, but if you want we can spend a little time on a joint screen session comparing notes (weekends is best for this). Where the fix is simple, I have always sent it in

mdvx avatar Dec 15 '20 00:12 mdvx

@mdvx I think it would be great to do that, I'd be curious what the more invasive changes look like. Do you have discord? I setup a discord channel and it lends itself well to screen sharing/collab?

earce avatar Dec 15 '20 00:12 earce

yes, hang on, new PC need to install again

mdvx avatar Dec 15 '20 00:12 mdvx

may i join in that discord, gentleman

gewure avatar Dec 15 '20 19:12 gewure

@gewure yea the discord is setup as a badge on the project you can just click into it.

earce avatar Dec 15 '20 20:12 earce

@earce

@gewure yea the discord is setup as a badge on the project you can just click into it.

Your going to need to help me here, I just dont see it anywhere?

mdvx avatar Dec 16 '20 01:12 mdvx

@earce

@gewure yea the discord is setup as a badge on the project you can just click into it.

Your going to need to help me here, I just dont see it anywhere?

I had to search too..hidden very small in the Readme it is :)

https://discord.gg/n27zjVTbDz

gewure avatar Dec 16 '20 03:12 gewure

@mdvx were you able to hop on? also would you like to spend some time this weekend ?

earce avatar Dec 16 '20 18:12 earce

2020-12-17 02:14:10.536 ERROR 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService        : Reconnection failed: connection timed out: api.huobi.pro/104.16.231.188:443
2020-12-17 02:14:10.536  INFO 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService        : Scheduling reconnection
2020-12-17 02:14:10.536  WARN 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService        : Problem with connection

io.netty.channel.ConnectTimeoutException: connection timed out: api.huobi.pro/104.16.231.188:443
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.54.Final.jar:4.1.54.Final]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.54.Final.jar:4.1.54.Final]

gewure avatar Dec 17 '20 01:12 gewure

@earce Was there ever any resolution on this? I read through the Netty issue and it seemed like it was fixed. ?

nibarulabs avatar Jul 05 '22 21:07 nibarulabs