activerecord-cockroachdb-adapter icon indicating copy to clipboard operation
activerecord-cockroachdb-adapter copied to clipboard

Ruby queries hang when returning too much data

Open harelsafra opened this issue 1 year ago • 18 comments

Ruby queries hang when returning "too much" data. I have this in quote since we don't understand what is too much but when limiting the result set the queries return. When the Ruby process hangs the session is idle in the database.

The same queries work on Postgres.

For example, with the products table

CREATE TABLE public.products (
	id INT8 NOT NULL DEFAULT 27000000000:::INT8 + unordered_unique_rowid(),
	order_id INT8 NULL,
	merchant_id INT8 NULL,
	quantity INT8 NULL,
	price FLOAT8 NULL,
	title VARCHAR(255) NULL,
	product_id VARCHAR(255) NULL,
	created_at TIMESTAMP NOT NULL,
	updated_at TIMESTAMP NOT NULL,
	orig_price FLOAT8 NULL,
	category VARCHAR(255) NULL,
	sub_category VARCHAR(255) NULL,
	section VARCHAR(255) NULL,
	requires_shipping BOOL NULL,
	date TIMESTAMP NULL,
	brand VARCHAR(255) NULL,
	seller_id VARCHAR(255) NULL,
	product_type VARCHAR(255) NULL,
	norm_title VARCHAR(255) NULL,
	delivered_to VARCHAR(255) NULL,
	delivered_at TIMESTAMP NULL,
	parsed_title VARCHAR(255) NULL,
	is_gift_wrapped BOOL NULL,
	message VARCHAR(3724) NULL,
	fulfillment_id INT8 NULL,
	sku VARCHAR NULL,
	registry_type VARCHAR NULL,
	CONSTRAINT products_pk PRIMARY KEY (id ASC),
	INDEX products_order_id_idx_new (order_id ASC),
	INDEX products_merchant_id_idx (merchant_id ASC),
	INDEX products_updated_at_idx (updated_at ASC)
)
a= Product.find_by_sql("select * from products where order_id=38313780972592 limit 114").size
# => 114

Product.find_by_sql("select * from products where order_id=38313780972592 limit 115").size
# <<<<<THIS HANGS>>>>>>

harelsafra avatar Jul 22 '24 15:07 harelsafra

ruby '3.1.4' activerecord (6.1.7.3) pg (1.5.3)

harelsafra avatar Jul 22 '24 15:07 harelsafra

@harelsafra are you using this adapter or upstream activerecord? Does the issue happen with this adapter specifically?

dikshant avatar Jul 22 '24 15:07 dikshant

Hi @dikshant we're using the active record currently but we tested activerecord-cockroachdb-adapter and we're seeing the same behavior

harelsafra avatar Jul 22 '24 15:07 harelsafra

It appears that this issue seems to happen when sql.defaults.results_buffer.size is set to 500KiB. This setting was added in 23.2.7. Setting to a lower value like 16KiB does not cause the hang.

rimadeodhar avatar Jul 22 '24 18:07 rimadeodhar

To clarify, the sql.defaults.results_buffer.size setting has been in CockroachDB since v19.1. (see https://github.com/cockroachdb/cockroach/pull/32366)

The recent change that happened is in v24.1.1, where the default value of this setting was changed from 16KiB to 512KiB. (see https://www.cockroachlabs.com/docs/releases/v24.1#v24-1-1-sql-language-changes)

We are still investigating why this causes the query result to hang.

rafiss avatar Jul 22 '24 21:07 rafiss

To assist with the investigation, it would help tremendously if you could share instructions on how to reproduce the issue. In our own testing, we have not yet seen the behavior you describe. We used a scenario along the lines of:

ActiveRecord::Base.connection.execute("
  CREATE TABLE public.products (
	id INT8 NOT NULL DEFAULT 27000000000:::INT8 + unordered_unique_rowid(),
	order_id INT8 NULL
   # other columns
)")

class Product < ActiveRecord::Base
end

128.times { Product.create!(order_id: 38313780972592) }

puts "Running 114"
p Product.find_by_sql("select * from products where order_id=38313780972592 limit 114").size

puts "Running 115"
p Product.find_by_sql("select * from products where order_id=38313780972592 limit 115").size

The instructions you share ideally should be a self-contained example, so we can follow the instructions starting from an empty cluster. Also, just to confirm, are you using CockroachDB v23.2.3?

Thank you!

rafiss avatar Jul 22 '24 22:07 rafiss

Here's one more thing to try, which should be easier than coming up with a self-contained examples that reproduces the problem.

I read more about concurrency and deadlocks in Rails these docs.

It has a debugging tip for investigating deadlocks/hangs:

If your application is deadlocking and you think the Load Interlock may be involved, you can temporarily add the ActionDispatch::DebugLocks middleware to config/application.rb:

config.middleware.insert_before Rack::Sendfile,
                                  ActionDispatch::DebugLocks

If you then restart the application and re-trigger the deadlock condition, /rails/locks will show a summary of all threads currently known to the interlock, which lock level they are holding or awaiting, and their current backtrace.

Generally a deadlock will be caused by the interlock conflicting with some other external lock or blocking I/O call.

Can you try using this setting then re-run your example that causes the hanging query and share the results from /rails/locks?

rafiss avatar Jul 23 '24 05:07 rafiss

Hi @rafiss I worked on it and got a reproducer working with obfuscated data, on a different table. This way I hope you can debug it more efficiently.

CREATE TABLE public.user_shops (
    id INT8 NOT NULL DEFAULT nextval('public.user_shops_id_seq'::REGCLASS),
    user_id INT8 NULL,
    merchant_id INT8 NULL,
    notifications JSONB NULL,
    merchant_role_id INT8 NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL,
    status VARCHAR NOT NULL DEFAULT 'inactive':::STRING::VARCHAR,
    activation_token VARCHAR NULL,
    activated_at TIMESTAMP NULL,
    CONSTRAINT user_shops_pkey PRIMARY KEY (id ASC),
    INDEX index_user_shops_on_activation_token (activation_token ASC),
    INDEX index_user_shops_on_merchant_id (merchant_id ASC),
    INDEX index_user_shops_on_merchant_role_id (merchant_role_id ASC)
)

user_shops__202407230902.sql.zip

The attached zip has insert statements for the needed data

If you run SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 193 it hangs. Limiting to 192 and 194 works

UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 192").size
# => 192

UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 194").size

# => 194
UserShop.find_by_sql("SELECT user_shops.* FROM user_shops WHERE user_shops.merchant_id = 24813 order by id limit 193").size

# <<<<<<<<<<<< HANGS >>>>>>>>>>

harelsafra avatar Jul 23 '24 06:07 harelsafra

Prefixing the sql with CREATE SEQUENCE public.user_shops_id_seq;. By default I don't have any issue in either v23.2.5 or v24.1 and either 16 KiB or 512 KiB buffer sizes. A self contained example would be very helpful indeed as there are many settings both in CRDB and Ruby that might be the root cause of your issue. Here's a good template to start with

BuonOmo avatar Jul 23 '24 14:07 BuonOmo

Thanks for sharing more details @harelsafra. As part of your reproducer example, can you please also share the commands you use to connect to CockroachDB and the definition of the UserShop model?

The goal is to have an example that matches as close as possible to the exact steps you are taking when you encounter the problem, so that when we debug this, we can be sure that we are experiencing the same behavior that you are. Otherwise, we will need to take a guess at some of the details, and we may not do the same thing that you are. (I also tried with your latest example, but did not encounter the problem.)

The link shared by @BuonOmo is an excellent starting point.

rafiss avatar Jul 23 '24 21:07 rafiss

This is the section from database.yml:

webapp:
  adapter: postgresql
  port: 5432
  encoding: unicode
  host: <%= ENV['DB_WEBAPP_HOST'] %>
  database: <%= ENV['DB_WEBAPP_DATABASE'] %>
  username: <%= ENV["DB_WEBAPP_USERNAME"] %>
  password: <%= ENV["DB_WEBAPP_PASSWORD"] %>
  pool: <%= (ENV['DB_CONNECTION_POOL_SIZE'] || 3).to_i %>
  statement_limit: 100
  connect_timeout: 2
  checkout_timeout: 2
  prepared_statements: true
  variables:
    statement_timeout: <%= ENV['DATABASE_STATEMENT_TIMEOUT'] || case ENV['APP_TYPE']; when 'webapp'; '30s'; when 'webhooks'; '5s'; when 'webhooks_sync'; '1s'; when 'webhooks-queue-worker'; '5s'; when 'adminserver'; '120s'; else '30s' end %>

The table's model calls other stuff and isn't clean enough to share.

I most confess that I'm a rails noob. Will try to get the simple example that @BuonOmo sent working

We're also trying in parallel to get deadlock debugging to work.

harelsafra avatar Jul 23 '24 21:07 harelsafra

Could you show the output of:

puts({
	DB_CONNECTION_POOL_SIZE: ENV['DB_CONNECTION_POOL_SIZE'],
	DATABASE_STATEMENT_TIMEOUT: ENV['DATABASE_STATEMENT_TIMEOUT'],
	APP_TYPE: ENV['APP_TYPE']
})

I'm not sure the deadlock debugging will yield much unfortunately as it is a single query running. You could also ensure it by setting DB_CONNECTION_POOL_SIZE=1 in your env.

BuonOmo avatar Jul 24 '24 09:07 BuonOmo

Attached

puts({
 DB_CONNECTION_POOL_SIZE: ENV['DB_CONNECTION_POOL_SIZE'],
 DATABASE_STATEMENT_TIMEOUT: ENV['DATABASE_STATEMENT_TIMEOUT'],
 APP_TYPE: ENV['APP_TYPE']
})
{:DB_CONNECTION_POOL_SIZE=>nil, :DATABASE_STATEMENT_TIMEOUT=>nil, :APP_TYPE=>"adminserver"}

I also made some progress in creating a reproducer: Used code similar to the one in https://github.com/rails/rails/blob/main/guides/bug_report_templates/active_record.rb (attached zip) When it connects to our 3 node staging cluster v.23.2.3 it hangs When it connects to a local docker running on my Mac v.23.2.3 it works When it connects to a serverless cluster in Cockroach cloud v.23.2.7 it works

Trying to get a local v23.2.3 cluster running to test on it

active_record.rb.zip

harelsafra avatar Jul 24 '24 11:07 harelsafra

The reproduction you sent is using an old version of the adapter, and not specifying which one in the inline gemfile. Could you specify which version you are using ? And try to use the latest possible of course.

I tried to reproduce your bug with the 6.1.11 version of this adapter for now without result. I think I am also misunderstanding:

When it connects to our 3 node staging cluster v.23.2.3 it hangs

Are you sharding from the rails client? And if so how are you doing it? Or are the three nodes accessed with only one URI?

BuonOmo avatar Aug 08 '24 13:08 BuonOmo

Hi @BuonOmo, we're using activerecord (6.1.7.3) Unfortunately this can't can't be changed because upgrading rails is a big effort that isn't planned at this stage.

The 3 nodes are accessed through an AWS network load balancer. The NLB connects the client to a single server in the cluster

harelsafra avatar Aug 11 '24 11:08 harelsafra

@harelsafra I'm asking about the adapter version (e.g: the gem activerecord-cockroachdb-adapter). Although I'd advise strongly updating rails to 6.1.11 (those are patches, so no interface changes, only security...)

BuonOmo avatar Aug 12 '24 14:08 BuonOmo

The 3 nodes are accessed through an AWS network load balancer. The NLB connects the client to a single server in the cluster

@harelsafra Can you try to see if you can reproduce the problem if you make your code connect directly to a CockroachDB node, and bypass the NLB? This will help us determine if the problem only occurs while using an NLB.

rafiss avatar Aug 12 '24 15:08 rafiss

This issue in the ruby-pg repo might also be related. It describes a situation where the driver can hang. https://github.com/ged/ruby-pg/issues/583

To test if that is at play here, could you try setting async_api to false as indicated in that issue?

rafiss avatar Aug 19 '24 17:08 rafiss

Closing since we're missing a reproduction. Please reopen if you have one

BuonOmo avatar Nov 19 '25 10:11 BuonOmo