ruby-pg Query hangs when using async

trafficstars

I've been debugging an issue where we have intermittent hangs of our ruby processes when running queries on CockroachDB using the ruby-pg driver. This is with with ruby-3.3.4, libpq-dev 15.8-0+deb12u1, ruby-pg 1.5.7, and Linux 6.8.0-1012-gcp however I have been able to reproduce the issue on other versions also.

We are able to reproduce the issue ~80% of the time with a simple query like this:

conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

This will hang forever, tying up a ruby thread in our sidekiq workers until the worker is shutdown, even days later.

If I disable async_exec, the same query always works, i.e. this works 100% of the time:

PG::Connection.async_api = false
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

The triggering of this is related to the content returned from the query - if I leave out certain columns, or remove the ORDER BY, then the issue doesn't show up.

Here's the gdb trace when the query is hung:

(gdb) where
#0  0x000070e98a319256 in __ppoll (fds=fds@entry=0x7fff46646eb8, nfds=nfds@entry=1, timeout=<optimized out>,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x000070e98a878390 in rb_thread_wait_for_single_fd (fd=6, events=1, timeout=timeout@entry=0x0)
    at /usr/src/ruby/hrtime.h:158
#2  0x000070e98a71dcd9 in rb_io_wait (timeout=4, events=3, io=<optimized out>)
    at ./include/ruby/internal/arithmetic/int.h:172
#3  rb_io_wait (io=<optimized out>, events=events@entry=3, timeout=timeout@entry=4) at io.c:1422
#4  0x000070e96ebab84a in wait_socket_readable (ptimeout=0x0, is_readable=<optimized out>, self=124147919560760)
    at pg_connection.c:2441
#5  pgconn_async_get_last_result (self=self@entry=124147919560760) at pg_connection.c:3139
#6  0x000070e96ebac939 in pgconn_async_exec (argc=1, argv=0x70e98a09c368, self=124147919560760) at pg_connection.c:3288
#7  0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=1,
    calling=<optimized out>, reg_cfp=0x70e98a19bbb0, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#8  vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bbb0, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#9  0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#10 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#11 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#12 0x000070e98a8c6e94 in eval_string_with_scope (line=9, file=<optimized out>, src=124147872936320,
    scope=124147909484040) at /usr/src/ruby/vm_eval.c:1761
#13 rb_f_eval (argc=<optimized out>, argv=<optimized out>, self=<optimized out>) at /usr/src/ruby/vm_eval.c:1804
#14 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=4,
    calling=<optimized out>, reg_cfp=0x70e98a19bc20, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#15 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bc20, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#16 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#17 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#18 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#19 0x000070e98a8d7320 in invoke_block (captured=0x70e98a19bed8, opt_pc=<optimized out>, type=<optimized out>, cref=0x0

It seems the issue is that wait_socket_readable is getting called when there is nothing more to read from the socket, and as pgconn_async_get_last_result is passing NULL as the timeout value ( https://github.com/ged/ruby-pg/blob/master/ext/pg_connection.c#L3139 ), the process just waits forever. I compiled ruby-pg with this change to hardcode a wait_timeout, and afterwards I would see a hang of a few seconds when the issue was triggered, rather than an infinite one:

diff --git a/ext/pg_connection.c b/ext/pg_connection.c
index e4194a3..1d5e276 100644
--- a/ext/pg_connection.c
+++ b/ext/pg_connection.c
@@ -2438,7 +2438,7 @@ wait_socket_readable( VALUE self, struct timeval *ptimeout, void *(*is_readable)

                        socket_io = pgconn_socket_io(self);
                        /* Wait for the socket to become readable before checking again */
-                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), wait_timeout);
+                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), 5);
                } else {
                        ret = Qfalse;
                }

This issue seems related to https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270 , and some investigation there suggested it was a result of the frame sizes.

Aug 16 '24 07:08 jordan-thoms

Is it possible to create a dockerfile with a repro?

Aug 16 '24 13:08 larskanis

The ruby side is easily replicable with docker, but requires a specific state on the CockroachDB side to trigger this, which I haven't been able to recreate so far - I'll try to do that.

Is it the expected behaviour that pgconn_async_get_last_result would ever block for a significant period? From the comment here it sounds like it shouldn't block: https://github.com/ged/ruby-pg/blob/beafa09c83b3ef0ac245f9aed0d378bdc82145d0/ext/pg_connection.c#L3138-L3139 However, since it calls wait_socket_readable without specifying a timeout, it can actually end up permanently blocked waiting for the socket to be readable...

Aug 17 '24 05:08 jordan-thoms

Is it the expected behaviour that pgconn_async_get_last_result would ever block for a significant period? From the comment here it sounds like it shouldn't block:

Yes it is intended that it blocks with no timeout. I improved the comment in 7ced09224fcb0351cfe79fd05a0e346e84cd443f . I could imagine that there's a bad interaction between OpenSSL and libpq as described in https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270 . Is it possible for you to test it without SSL?

Aug 17 '24 08:08 larskanis

I'm also experiencing this issue, on Ruby 3.4.1 with ruby-pg 1.5.9 and libpq-devel 13.20 on Rocky Linux 9.

I first saw this issue after upgrading to CockroachDB 24.1.13 - I haven't seen it on CockroachDB 23.2.20. I have reproduced it using the below data in 24.1.1+ but haven't been able to reproduce it in Cockroach 24.1.0. (I'm reporting this issue to CockroachDB too as it's unclear to me where the root problem is.)

I can reproduce it reliably with the schema and test data in the attached zip file (repro_data.sql). I've also included several queries numbered 112 to 117 in the zip file which have predictable behaviour. Query 112 lists 112 IDs in its WHERE clause and each subsequent numbered query adds one more value to that list without modifying any of the ones that were already there.

Using a fresh connection created with conn = PG.connect(<connection parameters>), and using conn.async_exec to run the query (ie. conn = ..., conn.async_exec(... as many times as needed to reproduce, then start again with conn = ... for the next query):

Query 112 is always successful regardless of the number of times it's run
Queries 113, 114, and 115 run successfully once, then hang on the second execution
Query 116 runs successfully twice, then hangs on the third execution
Query 117 is always successful regardless of the number of times it's run

It doesn't matter which order I select the columns in - it hangs regardless. As @jordan-thoms described, it doesn't hang if I select a different number of columns or different columns in general, but if I replace accessed_on with created_on (which is the same type/shape) it hangs in the same way. It also didn't hang when I replaced the values in the data column with random values of varying lengths.

(Hanging also happens on a non-fresh connection, but it's easier to provide exact repro behaviour with a fresh one.)

I used query 115 to experiment with PG::Connection.async_api set to true and false, and found that if the second execution is run with PG::Connection.async_api = false, then subsequent executions don't hang even if PG::Connection.async_api = true, but if the second execution is run with PG::Connection.async_api = true then it always hangs, ie:

PG::Connection.async_api = true
conn.async_exec(<query 115>) # works
PG::Connection.async_api = true
conn.async_exec(<query 115>) # hangs

PG::Connection.async_api = true
conn.async_exec(<query 115>) # works
PG::Connection.async_api = false
conn.async_exec(<query 115>) # works
PG::Connection.async_api = true
conn.async_exec(<query 115>) # works any number of times

PG::Connection.async_api = false
conn.async_exec(<query 115>) # works
PG::Connection.async_api = true
conn.async_exec(<query 115>) # hangs

PG::Connection.async_api = false
conn.async_exec(<query 115>) # works
PG::Connection.async_api = false
conn.async_exec(<query 115>) # works
PG::Connection.async_api = true
conn.async_exec(<query 115>) # works any number of times

crdb_hang_repro.zip

Mar 05 '25 04:03 qcn

ruby-pg
ruby-pg copied to clipboard

Query hangs when using async_exec

ruby-pg ruby-pg copied to clipboard

Query hangs when using async_exec

ruby-pg
ruby-pg copied to clipboard