ozo icon indicating copy to clipboard operation
ozo copied to clipboard

async_connect does not handle socket from libpq properly, fails to connect in some environments

Open msherman13 opened this issue 2 years ago • 3 comments

I believe that the async_connect code violates the libpq documentation by assuming that the underlying socket file-descriptor will not change over the course of the connection sequence. The relevant section of the documentation is below:

https://www.postgresql.org/docs/current/libpq-connect.html

If PQconnectStart or PQconnectStartParams succeeds, the next stage is to poll libpq so that it can proceed with the connection sequence. Use PQsocket(conn) to obtain the descriptor of the socket underlying the database connection. (Caution: do not assume that the socket remains the same across PQconnectPoll calls.) Loop thus: If PQconnectPoll(conn) last returned PGRES_POLLING_READING, wait until the socket is ready to read (as indicated by select(), poll(), or similar system function). Then call PQconnectPoll(conn) again. Conversely, if PQconnectPoll(conn) last returned PGRES_POLLING_WRITING, wait until the socket is ready to write, then call PQconnectPoll(conn) again. On the first iteration, i.e., if you have yet to call PQconnectPoll, behave as if it last returned PGRES_POLLING_WRITING. Continue this loop until PQconnectPoll(conn) returns PGRES_POLLING_FAILED, indicating the connection procedure has failed, or PGRES_POLLING_OK, indicating the connection has been successfully made.

In my environment (running local postgres 12 server in a docker container), the connection always times out. After debugging, I found that the below code fixes the issue, although it is probably a huge hack. I am not familiar enough with the ozo codebase to feel confident in this fix:

diff --git a/include/ozo/impl/connection.h b/include/ozo/impl/connection.h
index fc92b5e..d458d43 100644
--- a/include/ozo/impl/connection.h
+++ b/include/ozo/impl/connection.h
@@ -62,12 +62,14 @@ ozo::pg::conn connection<OidMap, Statistics>::release() {
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_write(WaitHandler&& h) {
+    assign(release());
     socket_.async_write_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }
 
 template <typename OidMap, typename Statistics>
 template <typename WaitHandler>
 void connection<OidMap, Statistics>::async_wait_read(WaitHandler&& h) {
+    assign(release());
     socket_.async_read_some(asio::null_buffers(), std::forward<WaitHandler>(h));
 }

Unrelated, but it is also worth noting that the null_buffers method of waiting for the socket to become ready is deprecated. The preferred method is to use socket_.async_wait

msherman13 avatar May 25 '22 17:05 msherman13

Hi! Thanks for reaching us. Well, yes, it definitely violates documentation in cases of multi-host connection string. Do you use a multi-host in your connection string, or have the issue with a single host in the connection string?

thed636 avatar May 26 '22 16:05 thed636

I am having the issue with a single-host connection string

msherman13 avatar May 27 '22 15:05 msherman13

the original code seems to work on some client hosts, but not others. I'm not sure the difference between the two hosts I tested on, both centos 7.5

msherman13 avatar May 27 '22 15:05 msherman13