riemann-java-client icon indicating copy to clipboard operation
riemann-java-client copied to clipboard

TcpTransport `writeTimeout` appears to be ignored

Open benfleis opened this issue 9 years ago • 8 comments

Both from my own testing, and from looking through the code, it appears the the writeTimeout AtomicBoolean is ignored. It looks like it should be ferried through to bootstrap, and then possibly have some exception handling added to make it do the right thing. (I dunno, I've never used netty myself.) In any case, I see in deployment cases where it takes multiple minutes for the socket to recognize that it's dead (seeing TCP PUSH over and over); if the default of 5s was effective, this slow stream failure detection would be avoided by the high level timeout.

benfleis avatar Jan 16 '15 16:01 benfleis

Soooo, this is kind of an interesting question. Right now, you can pick arbitrary timeouts for a write simply by choosing the timescale for deref(), but it may also make sense to force a timeout for all clients. That imposes an upper bound on deref timeouts, so I want to make sure the default behavior is sane and documented. How would you expect deref to interact with a transport timeout setting? And what do you think sane defaults are? 10 seconds?

aphyr avatar Jan 16 '15 17:01 aphyr

I am inclined to agree, practically speaking. In my particular case, I'm using a RiemannReporter -- in a perfect world, I just give it the host/port of the riemann service, and trust that it will "live well", and survive this situation. The DNS cache disabling patch I sent earlier made it at least workable, even if with a long delay after the riemann service restart.

The pedant in me asks: is this variable being honored in any way? If not, maybe at least kill the fake variable.

Sane defaults -- 60s or less. 10s is not aggressive, per se, but it's definitely not passive :) Does anybody ever want a timeout >60s when deploying the sort of system that uses riemann?

benfleis avatar Jan 16 '15 18:01 benfleis

Yeah, I tend to instrument on 1 or 5-second interval, so 10s is already stale. ;-) I'll have to dig into the source and see whether WriteTimeout does anything right now. For a while I was aiming to have partial backpressure on async sends but wound up backing off that; couldn't get stable behavior under load. My guess is it's unused but we should hook it up to the Netty channel options.

aphyr avatar Jan 16 '15 18:01 aphyr

I can create a trivial diff to pass the param through, but anything more is beyond my current java/netty + riemann-client scope, at least to do with any competence. Let me know what's useful.

benfleis avatar Jan 19 '15 15:01 benfleis

Yeah, could you shoot me a PR for this? I'm, like, incredibly backlogged right now and haven't had good internet, but that'll make sure I get around to reviewing/merging. :)

aphyr avatar Jan 20 '15 19:01 aphyr

With the current state of the library, what is the correct behavior when a deref times out? Should I be calling .close() on the client and then .connect() again?

eric avatar May 03 '16 00:05 eric

Reconnects have been automatic for 2 years or so, IIRC. On May 2, 2016 7:26 PM, Eric Lindvall [email protected] wrote:With the current state of the library, what is the correct behavior when a deref times out? Should I be calling .close() on the client and then .connect() again?

—You are receiving this because you commented.Reply to this email directly or view it on GitHub

aphyr avatar May 03 '16 02:05 aphyr

It does look like there is reconnection logic, but I couldn't see where a reconnection would happen if a deref timed out. Without a write timeout, it seems that you can get into a pathological situation (which I've seen many times) where the process doesn't see the connection is closed, but the write buffer is full, so nothing else can be written to the socket.

I guess I'll continue to close the connection on timeout.

eric avatar May 03 '16 02:05 eric