ruby-pg icon indicating copy to clipboard operation
ruby-pg copied to clipboard

AWS RDS Aurora Serverless connection hangs

Open joelvh opened this issue 5 years ago • 39 comments

I've been using the pg gem in an AWS Lambda function, connecting to AWS RDS Aurora Serverless PostgreSQL. The connection would hang periodically, when executing a query. It was not always the same query, but would eventually start to hang consistently.

I wanted to see if anyone else has seen this behavior or if it might be something specific to my environment. I posted a stacktrace using Sequel here: https://github.com/jeremyevans/sequel/issues/1660#issuecomment-549086384

At first I thought it might be that the gems were compiled with the wrong version of the PostgreSQL client, but I don't think that is the issue. Maybe this is considered out-of-scope for the pg project, but for what it's worth, I'm using the lambci/lambda image to replicate the AWS Lambda environment.

My Dockerfile installs PostgreSQL and gems like this:

FROM lambci/lambda:build-ruby2.5

# RUN yum install -y postgresql postgresql-devel
RUN yum install -y wget
# See: https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-6-x86_64/
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-libs-10.7-1PGDG.rhel6.x86_64.rpm
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-10.7-1PGDG.rhel6.x86_64.rpm
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-devel-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-libs-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-devel-10.7-1PGDG.rhel6.x86_64.rpm
RUN gem update bundler
# See: https://github.com/stevenringo/lambda-ruby-pg-nokogiri/blob/master/Makefile#L51
RUN bundle config --local build.pg --with-pg-config=/usr/pgsql-10/bin/pg_config
RUN bundle config --local silence_root_warning true
ENV PATH "$PATH:/usr/pgsql-10/bin"
# ENV POSTGRES_LIB=/usr/pgsql-10/lib
# RUN ls -la /usr/lib64

ADD Gemfile /var/task/Gemfile
ADD Gemfile.lock /var/task/Gemfile.lock

RUN bundle install --standalone --path /var/task/vendor/bundle --clean
RUN mkdir -p /var/task/lib
RUN cp -a /usr/pgsql-10/lib/*.so.* /var/task/lib/

Could this be a case where the pg gem is having a hard time with an unusual response from AWS RDS Aurora Serverless?

Thanks in advance.

joelvh avatar Jan 21 '20 00:01 joelvh

Both Lambda and Aurora Serverless sound like cases where connections should be handled carefully. In the Sequel issue, I see you are using a connection pool.

Questions that come to mind:

  1. Is the traffic to this Lambda function steady or sporadic?
  2. Is the traffic to this Aurora cluster steady or sporadic?
  3. What is happening at the Aurora cluster when the query hangs?
  4. How many connections are free/idle versus used in the connection pool?
  5. How many connections are idle versus active in the Aurora cluster?
  6. Are you using the Sequel connection validator?

cbandy avatar Jan 21 '20 05:01 cbandy

Hi @cbandy thanks for the feedback. The traffic to the Lambda function was steady and thereby the same goes for the Aurora cluster. However, I wasn't able to get much insight into the Aurora cluster and what was happening from that point of view as the connection hung.

I'll review our setup to double-check the areas you've mentioned. I hadn't heard of the Sequel connection validator. It does remind me that we tried several settings such as pool size and number of threads. We also set the timeout around 5 seconds or so, yet (as Jeremy mentioned) no error was raised by pg that Sequel could handle and the request timedout at 30 seconds (the Lambda max).

Do you think that using dead pooled connections would cause those symptoms where pg hangs and doesn't throw some kind of error?

joelvh avatar Jan 21 '20 07:01 joelvh

However, I wasn't able to get much insight into the Aurora cluster and what was happening from that point of view as the connection hung.

You'll need to go to Aurora separately to see it's view of the world. e.g. is it scaling at that moment? Does it report fewer than max connections? Is the connection in question visible there? If so, is it idle, active, or waiting?

Do you think that using dead pooled connections would cause those symptoms where pg hangs and doesn't throw some kind of error?

I think so, but I don't have experience with this exact setup. Narrowing it down depends on what the other end of the connection sees/believes.

cbandy avatar Jan 21 '20 14:01 cbandy

@cbandy I'll work on testing things out a bit more and report back. Thanks for the help!

joelvh avatar Jan 21 '20 19:01 joelvh

I don't have any more details to add yet, but I can confirm we're seeing this same thing on two different projects. Both are Rails apps running on EC2 instances, not Lambda.

spikex avatar Jan 24 '20 00:01 spikex

I work with @spike I was able to get a somewhat repeatable example for a few hours on our of our customers systems (ActiveRecord 6, pg 1.1.4) and it was a select that returned 3 columns from all rows in the table. Say I had a table t with exactly 50,000 rows:

  • if the AR query resolved to select a,b,c from t then it would hang,
  • if the AR query resolved to select a,b,c from t limit 49999 it would return fine
  • if the AR query resolved to select a,b,c from t limit 50000 it would hang.

I did all of these via AR in irb, and then did the resulting query via the raw ruby PG connection and got the same results.

Doing the queries via psql from the same client machine as the rails machine resulted in successes on all fronts.

Unfortunately, a couple of hours later, while attempting to get trace data, the hanging behavior stopped on my test table. It didn't happen for a few days, and now is coming back on a different table. From my perspecitve, this is turning into a heisenbug. I'm waiting for this to happen consistently again and then I'll be doing a full trace, and possibly a tcpdump.

From the Aurora Serless perspective, this is an over allocated cluster, very low load, and currently I'm not allowing it to change size so that we can rule a scaling event out as the cause.

I know this doesn't really help all that much at the moment, I'll have more information when I can get a proper capture. Just letting folks know that there's definitely something here.

copiousfreetime avatar Feb 10 '20 22:02 copiousfreetime

One additional data point. On our other project, I migrated the client from Serverless to Servered (Serverful? Severicious?) and the problem when away.

spikex avatar Feb 11 '20 17:02 spikex

@marcilio @danilop seen you guys write about Aurora Postgres Serverless at AWS - heard of this kind of issue, in light of the fact it may be the serverless flavor of Aurora causing our issue? hope it’s not a reach to pull you guys in on a potentially Ruby-specific issue. Thanks!

joelvh avatar Feb 11 '20 17:02 joelvh

@copiousfreetime Thank you for sharing the details of your investigation! I don't have a good idea so far, maybe you could try to use the synchronous query API per PG::Connection#sync_exec instead of just exec the next time. Just in case you get a chance to do so.

larskanis avatar Feb 11 '20 20:02 larskanis

@larskanis After a long time of not happening - it is happening again right now. I have a query that is supposed to return 440 rows - I've managed to get a solid pg trace using PG:Connection.trace. Since there is some email addresses and such of customer data - I've created a private repo to store the traces and invited you @cbandy and @spikex to the repo to take a look if you want.

The main results are:

the query

This query should return 440 rows of data

    SELECT "form_submissions".*
      FROM "form_submissions"
     WHERE "form_submissions"."type" = 'ObservationSubmission'
       AND (processed_at is not null)"

Using the following script I attmpted to run the query usingsync_exec, async_exec, and async_exec_params. All of those with and without and ORDER BY created_at clause.

#!/usr/bin/env ruby

sql = "SELECT \"form_submissions\".* FROM \"form_submissions\" WHERE \"form_submissions\".\"type\" = 'ObservationSubmission' AND (processed_at is not null)"

ar_connection = ObservationSubmission.connection
pg_connection = ar_connection.instance_variable_get("@connection") # PGConnection

# hangs
# method = "async_exec_params"
# method = "async_exec"

# works fine
method = "sync_exec"

File.open("#{method}.trace", "w+") do |f|
  # f.sync = true
  pg_connection.trace(f)
  result = pg_connection.send(method, sql)
  i = 0
  rows = []
  result.each do |row|
    i += 1
    rows << row.values
  end

  puts "#{i} done"
end

A ruby-pg trace was captured for each run.

results

async_exec_params was not included as it turned out to be byte-for-byte trace equivalent to async_exec

method order by result
sync_exec no 440 rows
sync_exec yes 440 rows
async_exec no hang
async_exec yes 440 rows

oddity

Since I had kill the async_exec process I thought to turn off file buffering on the trace file. When I did this and added f.sync = true (line 16 in trace.rg) to the File that was getting the pg trace, theasync_exec worked correctly and returned 440 rows.

This was repeatable, I took the file sync out and the async_exec would hang. Put it back in and it would work fine.

thoughts

The trace differences weren't a whole lot between the async and the sync, mostly it was the last row that just got cut off. Which to me seems like a buffering issue of some type.

copiousfreetime avatar Mar 21 '20 17:03 copiousfreetime

Update on the sync/async/order by - the order by does not appear be an indicator, or the File.sync. I have a different query, which is exactly the same type as above, and it has an order by on the query, and it has started to hang now - and it wasn't earlier. This new query does not hang consistently in isolation.

copiousfreetime avatar Mar 22 '20 01:03 copiousfreetime

Hmm. All of the traces appear cutoff. Perhaps there needs to be an f.flush or f.fsync at the end of the block (before the file is closed.)

Each of the traces ends in the middle of a column value of a DataRow message. For example, this is a 4-byte length of 1546 then 1546 bytes of the value:

From backend (#4)> 1546
From backend (1546)> ...

Unless I'm misunderstanding, the last trace line comes from here at which point the data has already been read off the wire and the entire value should be visible in the trace file.

cbandy avatar Mar 24 '20 01:03 cbandy

I'm pretty sure I'm experiencing this issue as well. I've tested with pg 0.19 and 1.2.3, talking to aurora serverless cluster (postgres 10.7 compatibility) and a postgres 10.7 RDS server instance for comparison.

I encountered the issue using a workflow that uses a cursor to scan through different tables 1000 rows at a time, and for a number of larger tables it will scan through a chunk of the table before stalling on a particular slice of rows. This is generally reproducible, so if it stalls on rows 171000 to 172000, it will stall on the same set of rows if I start the cursor at offset 170000 instead. Oddly, starting directly at offset 171000 sometimes fails and sometimes doesn't.

I've pg_dumped out tables and loaded them into a different database on the same cluster and the behavior is the same on the copy. Adding or removing indexes didn't appear to change anything either. As a side oddity though, renaming certain columns will make the workflow hang on a different set of 1000 rows (so at offset 367000 instead of 171000, for example).

I've compared traces for the postgres server and the serverless cluster and didn't observe any significant differences (the table ID was different, and certain single-letter protocol messages were repeated different numbers of times but I didn't see a pattern there). Note that these were traces after being interrupted using SIGINT, and with File.sync = true. When I kill -9 instead, the trace is cut off after the From in From backend in the middle of a row. Unfortunately I've had no luck replicating the issue with artificial data, so I can't easily share the trace files, but if you're interested I can try to find a way to sanitize them.

I've tested with sync_exec vs async_exec, and at first I thought that the problem was only with async_exec, but after attempting to monkey-patch Sequel to use sync_exec I discovered that the workflow using sync_exec also stalls, just on different offsets in each table.

aniketschneider avatar May 22 '20 16:05 aniketschneider

@aniketschneider This sounds a little bit like it stalls after the same amount of bytes transferred. What happens if you try a query like the following? It returns 1GiB of text as 1MiB rows.

SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024)

cbandy avatar May 23 '20 04:05 cbandy

@cbandy I thought I saw that query hang the first couple times I tried it, but either I didn't wait long enough for it to complete or it doesn't fail consistently, because now it's succeeding every time I run it.

However, when I use a cursor to scan through the results of that query, I can get it to fail:

require 'pg' # v 1.2.3

# removed connection info

conn = PG.connect(
  host: host,
  dbname: db,
  user: user
)

batch_size = ARGV[0]

sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024)"

cursor_name = "bug_replication_cursor"

File.open("trace.out", "w") do |f|
  f.sync = true
  conn.trace(f)

  conn.transaction do |conn|
    conn.exec("declare #{cursor_name} no scroll cursor without hold for #{sql}")
  
    slice_num = 0
    begin
      slice = conn.async_exec("fetch forward #{batch_size} from #{cursor_name}")
      puts "Slice #{slice_num}, #{slice.count} rows"
      slice_num += 1
    end while slice.any?
  end
end

I tried this with various batch sizes:

Batch size Hangs before completion?
1 to 6 no
7 to 8 yes
9 no
10 to 15 yes
16 no
17 to 19 yes
128 yes
256 no

Hard to find a pattern there; on the plus side, this is entirely artificial data, so perhaps it can be replicated on other clusters.

aniketschneider avatar May 26 '20 16:05 aniketschneider

When a particular batch size hangs, does it consistently/repeatedly do so on a particular slice_num?

cbandy avatar May 26 '20 17:05 cbandy

Sort of. With a batch size of 8, for example, I am seeing it hang on anywhere from slice_num 1 to 4, but I've never seen it get higher than that.

This doesn't really line up with how I saw it happen on our ETL data, where it seemed like it was always hanging on the same slice_num.

aniketschneider avatar May 26 '20 17:05 aniketschneider

Just to eliminate some potential false signals here, I've stripped out the transaction and the cursor and am seeing roughly the same behavior. I say "roughly" because this version seems to have a slightly different set of batch sizes for which it does not hang; for example, it does not hang on batch size 4, but does hang on 3.

require 'pg'

# removed connection info

conn = PG.connect(
  host: host,
  dbname: db,
  user: 'postgres'
)

if ARGV.length >= 1
  batch_size = ARGV[0]
end

sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,#{batch_size})"

File.open("trace.out", "w") do |f|
  f.sync = true
  conn.trace(f)

  slice_num = 0
  begin
    slice = conn.async_exec(sql)
    puts "Slice #{slice_num}, #{slice.count} rows"
    slice_num += 1
  end while slice.any?
end

aniketschneider avatar May 26 '20 18:05 aniketschneider

Another test could be streaming / single-row-mode: https://www.rubydoc.info/gems/pg/PG/Connection#set_single_row_mode-instance_method

cbandy avatar May 26 '20 20:05 cbandy

In our testing...

conn = PG::Connection.open(...)

conn.exec(query) # doesn't work (hangs)

conn = PG::Connection.open(...)

conn.send_query query
conn.get_result # works (doesn't hang)

conn = PG::Connection.open(...)

conn.send_query query
conn.block # doesn't work (hangs)

So the issue is in .block where wait_socket_readable is called.

I believe this issue is the same issue: https://github.com/ged/ruby-pg/issues/297

krazyjakee avatar Jun 24 '20 11:06 krazyjakee

Not sure if this can give any clue for the developers to debug the issue, but here is my experience with Ruby and Aurora Serverless.

We have a Rails 4 application with pg v0.18.4. The app has been active for 5 years without an issue in its connection with a RDS PostgreSQL.

The moment we migrated to Aurora Serverless we started to notice the behaviour described in this issue. Some queries just hanged and never returned any result. However, the Aurora logs indicated no error and all the queries were being reoslved in good time.

We wrote a dirty ActiveRecord extension to deal with this situation: https://gist.github.com/brenes/ca6a45c1399e413642cafa87f1c8b19d

After this we started to see a different behaviour. Every time the database tries to scale (up or down) every query just froze for some seconds (between 5 and 30 seconds and, in a few cases up to 5 minutes). Searching the logs for the message we set in the gist we saw an almost perfect match but we are not sure wether those queries would hang up forever or the would have returned a result through async_exec.

Speaking with Amazon Support they have confirmed to us how Aurora Serverless works.

  1. They have a proxy in front of the database where all the connections are established.
  2. Once they decide an scale (up or down) is required, every query sent enters in a queue where they wait for the scale.
  3. The database tries to find a point where the scale is safe (i.e no active query is modifying the database).
  4. The database scales.
  5. All the queries in the queue are resolved.

The points 3 and 4 are what takes the most of those 5 seconds to 5 minutes pauses.

Maybe all this proxy dance is what is causing the issues with the async_exec method? It could happen that some of the queries used as example before trigger an scale up operation (due to CPU use, maybe?) and creates the problem?

Sorry if all this is not quite helpful and thanks for the hard work!

brenes avatar Jun 29 '20 11:06 brenes

@brenes Thanks for the write-up here. Your extension is very interesting as we also had trouble scaling up or down and it's now clear why.

You symptom is equally valid as a bug. I believe what you're describing is a different symptom than ours but of the same issue.

krazyjakee avatar Jun 30 '20 12:06 krazyjakee

I am having the same issues. Running latest PG Gem with latest Rails. Also tried other versions.

Here is one query that hangs. It only hangs for specific combinations of LIMIT and OFFSET. For example increasing or decreasing either LIMIT or OFFSET by one and it will work fine. One odd thing is if I try the Photo Load by itself it loads fine.

irb(main):015:0> User.includes(:photos).order(:display_name, :created_at).page(11).per(25).each { |e| puts "HANGING" } User Load (10.7ms) SELECT "users".* FROM "users" ORDER BY "users"."display_name" ASC, "users"."created_at" ASC LIMIT $1 OFFSET $2 [["LIMIT", 25], ["OFFSET", 250]] ^C Photo Load (463863.1ms) SELECT "photos".* FROM "photos" WHERE "photos"."user_id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) ORDER BY "photos"."created_at" ASC [["user_id", "376971a1-1444-4bd7-8b07-ea87ff76ee69"], ["user_id", "d81c0d84-be0f-410f-bf93-0164d56d85ce"], ["user_id", "04347252-77e0-4582-967b-ded4e147ca08"], ["user_id", "dced8a64-629e-45f7-b04a-eb9b87824d42"], ["user_id", "5c83951c-cee7-454f-844c-67eb7a889f4d"], ["user_id", "2584154a-1dcc-4ecc-9370-2eb6bcf2be24"], ["user_id", "1bd3ea4e-e25d-4ebd-8b2f-29c2af4a2db5"], ["user_id", "70279c04-ef33-402b-adfc-8e079d603f64"], ["user_id", "3271e9eb-fb82-4f13-8c56-d8a5f11ae16f"], ["user_id", "9696f1d4-dc8f-4649-a8af-9caca6b0fb5d"], ["user_id", "4719fddb-15a2-4f25-919f-a2936a01b821"], ["user_id", "edc76477-46bc-4172-a2cf-5881e9224229"], ["user_id", "1717b022-62ca-4793-a5d3-2686669f26e4"], ["user_id", "03ea9be9-1331-4f0a-9c05-95d542c03ddf"], ["user_id", "b79b1b80-0734-4674-b152-98e5982d24dc"], ["user_id", "62afa220-6e37-42d2-9bda-b16e09180f16"], ["user_id", "584f9836-ddaf-4231-b613-1aee5c0c6bab"], ["user_id", "6f168913-e6c5-4fa9-9d5b-d4c4a0d6eb9c"], ["user_id", "3425484b-657d-44e1-ba9d-dae659523a49"], ["user_id", "b0e87545-16ba-4462-b41e-017704f31723"], ["user_id", "7b79628b-61ec-401d-9f72-05187cad3e9e"], ["user_id", "9feb7ac2-e91f-4886-a7a6-a02f2ffe0a91"], ["user_id", "f6c5af79-f5e0-448a-ae4e-d5dafdc11f54"], ["user_id", "b95f91a9-0e6c-48fa-b4f4-03087cf77783"], ["user_id", "f16203b4-27c3-4561-8d3b-3129502bc528"]] Traceback (most recent call last): 2: from (irb):15 1: from (irb):15:in `rescue in irb_binding' IRB::Abort (abort then interrupt!)

PG will just hang forever. It does not throw an error.

If I set PG::Connection.async_api = false the query completes as expected.

sprite2005 avatar Jul 11 '20 15:07 sprite2005

Possibly same issue here, but not using Ruby. I am using SQLAlchemy (Python) and the aurora-data-api extensions for pg. For the most part, I can query fine. Periodically the simplest of statements hang, and subsequently timeout (maybe 5% of all operations, not restricted to any one of select, insert, updated etc.)

Access patterns and volumes are very predictable, a rule fires every 2 hours and runs a Lambda function that truncates a table and then inserts ~5000 records (in batches). That's it.. no queries, just inserts. On the backside, I have QuickSight querying these datasets at around the same cadence (every 2 hours or so).

This was working perfectly until around May 20th. Prior to then, not a single timeout - everything else related to volume/traffic had not changed either. Logs, Cloudwatch etc. don't seem to be useful. Really odd.

fitzee avatar Jul 16 '20 03:07 fitzee

We've been running in circles on this on and off for a couple months, and got halfway to switching to a non-serverless solution, when we found that turning off SSL on the client connection made the stalling behavior disappear (that is, connecting with the sslmode: 'disable' option). That's probably not a reasonable step to take for every environment where this is happening, but hopefully it will be helpful for someone!

aniketschneider avatar Jul 27 '20 17:07 aniketschneider

We've been running in circles on this on and off for a couple months, and got halfway to switching to a non-serverless solution, when we found that turning off SSL on the client connection made the stalling behavior disappear (that is, connecting with the sslmode: 'disable' option). That's probably not a reasonable step to take for every environment where this is happening, but hopefully it will be helpful for someone!

If you don't want to disable SSL you can add 'PG::Connection.async_api = false' in an initializer. It fixed the problem for me.

sprite2005 avatar Jul 27 '20 21:07 sprite2005

@sprite2005 This is undocumented so can you tell me what are the consequences of doing this?

Thanks

krazyjakee avatar Aug 12 '20 08:08 krazyjakee

@sprite2005 This is undocumented so can you tell me what are the consequences of doing this?

Thanks

It will use sync_exec instead of async_exec. I've been running with async_api = false in production and on Aurora Serverless for a few months now and have not had any issues since doing that. Docs state this:

#exec is an alias for #async_exec which is almost identical to #sync_exec . #sync_exec is implemented on the simpler synchronous command processing API of libpq, whereas #async_exec is implemented on the asynchronous API and on ruby's IO mechanisms. Both methods ensure that other threads can process while waiting for the server to complete the request, but #sync_exec blocks all signals to be processed until the query is finished. This is most notably visible by a delayed reaction to Control+C. It's not recommended to use explicit sync or async variants but #exec instead, unless you have a good reason to do so.

sprite2005 avatar Aug 26 '20 08:08 sprite2005

@joelvh: What did you end up doing? Did you figure out a solution using Sequel? Could you elaborate a bit?

Thanks!

davidfromdk avatar Oct 19 '20 13:10 davidfromdk

@davidfromdk I ended up using an RDS Aurora instance - abandoned the Serverless flavor for the time being. Have been tracking this thread, but haven't yet had an opportunity to test the latest findings.

joelvh avatar Oct 19 '20 15:10 joelvh