r2d2 icon indicating copy to clipboard operation
r2d2 copied to clipboard

`get_timeout` exceeds timeout

Open ghost opened this issue 4 years ago • 11 comments

First, thanks for providing the community with multiple database tools.

Sometimes get_timeout(timeout) exceeds timeout. Consider an application that connects to a distributed database with multiple nodes, any of which can serve requests. Suppose the application calls get_timeout, and the receiving database node dies. If the application configured tests on checkout, get_timeout awaits the return of the underlying is_valid call before timing out. is_valid likely executes a trivial SQL query, which encounters TCP errors and retries until the TCP retry limit of the operating system, often 15 minutes, expires. Meanwhile, the application could have obtained a connection to a different database node.

One solution involves modification of the Rust-Postgres and Tokio-Postgres crates. It would expose a function, perhaps is_valid_timeout(timeout), that executes a trivial SQL query or times out. Do you recommend a different approach? If not, would you be open to a pull request in the rust-postgres repository that introduces is_valid_timeout?

ghost avatar Dec 03 '20 21:12 ghost

How does something like Hikari handle validity checks?

sfackler avatar Dec 03 '20 21:12 sfackler

During validation Hikari temporarily changes the network timeout of the connection here.

ghost avatar Dec 03 '20 22:12 ghost

Similarly, during validation a PostgresConnectionManager from r2d2-postgres could temporarily change the network timeout of its tls_connector. I will check whether the underlying types expose the necessary functionality.

ghost avatar Dec 03 '20 22:12 ghost

I will check whether the underlying types expose the necessary functionality.

Unfortunately, they do not. In r2d2-postgres the MakeTlsConnect wraps a tokio_postgres::Socket, which wraps a tokio::net::TcpStream, and neither has configurable timeouts.

ghost avatar Dec 04 '20 20:12 ghost

The Javadoc for Connection::setNetworkTimeout is kind of interesting in that it discusses the timeout only applying to reads and not writes. I guess in practice the outbound TCP buffer doesn't fill for a single query?

postgres could definitely grow support for read/write timeouts via tokio-postgres, though it is interesting to note that libpq does not offer that at all from what I can tell. I imagine they might just advise cranking the TCP keepalive interval down and relying on that.

sfackler avatar Dec 04 '20 20:12 sfackler

Thinking about this a bit more, it actually seems to me like TCP keepalive may be the better approach here compared to some kind of read/query timeout. It doesn't make assumptions about how long a query will legitimately take to run, and can detect a disconnect even while the connection is totally idle rather than only at the time of checkout.

sfackler avatar Dec 04 '20 20:12 sfackler

My understanding is TCP keepalives depend on three parameters:

  • keepalive time, the amount of idle time before a keepalive
  • keepalive interval, the amount of time after an unacknowledged keepalive before the next one
  • keepalive probes, the number of unacknowledged keepalives before connection closure

tokio::net::TcpStream exposes set_keepalive(duration), which sets the connection’s keepalive time to duration. I think it does not allow configuration of keepalive interval and keepalive probes, whose UNIX defaults are 75 seconds and 9, which would still allow long timeouts.

An alternative is configuring the parameters at the machine level, but doing so could affect other services running on the machine, or the machine may be inaccessible.

I appreciate general read and write timeouts might introduce complexity and agree with your point about query time. Instead, what about limiting timeouts to validity checks that only make trivial queries? The Java interface exposes that functionality through isValid(timeout).

ghost avatar Dec 07 '20 20:12 ghost

We could definitely add a variant of is_valid that takes a timeout, but I'd want to be pretty careful about what timeouts are actually passed to it. For example, if get_timeout is called with a 10 second timeout and we're 9.99995 seconds in when a connection becomes available, we probably wouldn't want to try calling is_valid with a 0.00005 second timeout, since that would almost assuredly fail.

sfackler avatar Dec 08 '20 01:12 sfackler

Sounds good! I think is_valid(timeout) would ultimately require access to the Tokio runtime, so I will start with a pull request in the rust-postgres repository.

ghost avatar Dec 08 '20 21:12 ghost

Now that Rust-Postgres exposes is_valid(timeout), I can think of two options that would maintain backward compatibility:

  • Let ManageConnection implementations bound is_valid() calls if they can or want. The bound would be independent of timeout in get_timeout(timeout).
  • Add a method is_valid_timeout to the ManageConnection trait that calls is_valid by default. Then update get_timeout accordingly.

What do you think?

ghost avatar Jan 06 '21 22:01 ghost

I think adding ManageTimeout::is_valid_timeout with a default impl seems best.

sfackler avatar Jan 06 '21 22:01 sfackler