XChange
XChange copied to clipboard
[Streaming-Core] frequent websocket disconnects
Hi all,
I have spent some time tracking (and attempting to improve) websocket connection reliability in XChange and wanted to detail what I have so far for anyone who may be experiencing the same thing (and anyone that may have time to also work on improving this).
TLDR; I am seeing websocket disconnects anywhere from hourly to every 3 hours to up to about 12 hours. This is a lot, a conversation with Kroitor @ CCXT made me aware of the fact that they can go several days without experiencing a disconnect (for CoinbasePro/Kraken). I suspect these disconnects are self inflicted.
We have an IdleReadTimeout (how long we go without reading anything over the websocket) which will trigger us to disconnect if we don't see anything for 15 seconds). I had added https://github.com/knowm/XChange/commit/3010143ce0d487756943d6178aefbe865ff91ef4 to send a ping because channels which weren't busy would disconnect every few minutes.
This improved the disconnect rate for me from every few minutes to every few hours but is still not what it should be. True websocket disconnects tend to be a little tricky to figure out correctly because if a server process dies without notifying the client of a shutdown/disconnect hook the client has to figure on its own if the channel is dead and this is a bit empirical, things like a long GC pause on the server size can make the client think the channel is dead but may not actually be, so read timeouts tend to be the best efforts.
In any case out of the disconnect's I see very rarely it is the server sending a close channel message which would print this log line LOG.info("WebSocket Client received closing! {}", ctx.channel());
. Most disconnects just tend to be what appears to be an inactive channel.
In the current stage I have modified the IdleStateHandler with additional logging to see what the stacktrace is before a channel has inactive methods called. I have attached said stack_trace.log. What is logged as read message is basically us reading a message over the websocket channel, you see there is a unusual gap before we start triggering idle code. However this gap is much smaller then the idle timeout. TempIdleStateHandler is what I've replaced IdleStateHandler in the NettyStreamingService class with.
I still have not figured out what the source of this issue is but hopefully this can be useful to anyone who has noticed this.
Small update I have tracked down what is triggering the channel closing in the Netty code. AbstractNioByteChannel.java set to close which triggers closeOnRead(pipeline)
. This appears like the websocket is entering a half open state but will update when I have more.
Another update, after opening an issue with the Netty folks it seems like this is getting triggered cause the websocket is receiving an EOF https://github.com/netty/netty/issues/10830.
This is very interesting. i am currently having similiar issues which make Xchange-Streaming almost unusuable to me ....
From 10 Exchanges tested, about half of them fail or have very bad issues.
Working as expected: Bitfinex Poloniex
Having this issue from time to time:
Kraken
Hitbtc
Gemini
Complaining a lot, or being unusable all:
Okcoin Coinmate
other issues: Cexio - complains about missing Apikeys, when they aren't actually missing LGO - ClassCastExcpetion when instantiating from StreamingExchangeFactory
i was wondering it was time to introduce some actuall deployment/automated tests for Xchange? As a new user it is rather frustrating to pull a fresh release and then see 50% of the Lib fail miserably :D
Currently nobody really knows which modules are having problems and why, as it is not tested or listed anywhere.
i guess using the Issues-System much more would help: e.g. tag things more detailed... most of the Issues here are not even tagged. E.g. this here is clearly a Bug-Report.
I don't use most of the exchanges you've mentioned there so I can't opine but feel free to submit some PRs improving this!
LGO sold to Voyager: https://www.prnewswire.com/news-releases/voyager-completes-merger-with-lgo-positioning-company-with-fully-regulated-european-platform-301190929.html
This may explain problems with API
I have fixed, or configured around problems here Kraken Hitbtc Gemini
@mdvx if these fixes don't exist in this codebase can you submit PRs?
@earce, sometimes my fixes are radical, but if you want we can spend a little time on a joint screen session comparing notes (weekends is best for this). Where the fix is simple, I have always sent it in
@mdvx I think it would be great to do that, I'd be curious what the more invasive changes look like. Do you have discord? I setup a discord channel and it lends itself well to screen sharing/collab?
yes, hang on, new PC need to install again
may i join in that discord, gentleman
@gewure yea the discord is setup as a badge on the project you can just click into it.
@earce
@gewure yea the discord is setup as a badge on the project you can just click into it.
Your going to need to help me here, I just dont see it anywhere?
@earce
@gewure yea the discord is setup as a badge on the project you can just click into it.
Your going to need to help me here, I just dont see it anywhere?
I had to search too..hidden very small in the Readme it is :)
https://discord.gg/n27zjVTbDz
@mdvx were you able to hop on? also would you like to spend some time this weekend ?
2020-12-17 02:14:10.536 ERROR 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService : Reconnection failed: connection timed out: api.huobi.pro/104.16.231.188:443
2020-12-17 02:14:10.536 INFO 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService : Scheduling reconnection
2020-12-17 02:14:10.536 WARN 2635139 --- [tLoopGroup-71-2] i.b.x.huobi.HuobiStreamingService : Problem with connection
io.netty.channel.ConnectTimeoutException: connection timed out: api.huobi.pro/104.16.231.188:443
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.54.Final.jar:4.1.54.Final]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.54.Final.jar:4.1.54.Final]
@earce Was there ever any resolution on this? I read through the Netty issue and it seemed like it was fixed. ?