cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

CockroachDB may experience sudden high latency across the entire cluster during write operations.

Open wuhuua opened this issue 1 year ago • 4 comments

Describe the problem I deployed a cluster using self-compiled CockroachDB v24.2.0-dev which has 50 nodes. When trying to start a write batch to the cluster, I found that CockroachDB may experience sudden high latency across the entire cluster during write operations. 企业微信截图_61ff9d75-b93b-4e40-8d23-20efae39de88

There's no network problem during my writing process, so I wonder why these sudden high latency occurs.

My steps to run the cluster and write process:

  1. Set up CockroachDB cluster /symmetricdbserver start --certs-dir=/local/certs --store=/symmetricdb/symmetricdb-data --listen-addr {{env "attr.unique.network.ip-address"}}:26259 --http-addr 0.0.0.0:8081 --join 0.symmetricdb-cluster-secure.service.consul:26259
  2. Send SQL ... / CLI command ... Write batch SQL is executing in a transaction as follows:
# In a transaction
INSERT INTO common_controls(created_at, updated_at, deleted_at, uin, bind_new_card_errcode, final_support_bank_list,
bank_limit_priority, id)
   VALUES (_, __more__) RETURNING id, id
INSERT INTO pay_methods(created_at, updated_at, deleted_at, uin, pay_method_type, account_type, bind_serial,
bank_type, pay_method_name, logo_url, default_card_setting_state, default_favor_compose_id, account_id,
is_default_pay_method, pay_method_color_type, id)
   VALUES (_, __more__) RETURNING id, id

Environment:

  • CockroachDB version 24.2.0-dev
  • Server OS: Linux/Centos
  • Client app: Gorm

Jira issue: CRDB-41447

wuhuua avatar Aug 19 '24 03:08 wuhuua

Hi @wuhuua, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Aug 19 '24 03:08 blathers-crl[bot]

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-foundations (found keywords: Gorm)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Aug 19 '24 03:08 blathers-crl[bot]

Hi @wuhuua, thanks for the report. To help me investigate, it would be helpful if you could gather some information for me:

  • A tsdump for the period of the workload run. You can follow the instructions here to collect one.
  • A statement bundle for a high-latency query, which you can collect by following the instructions here and setting a latency threshold.

You can send the results to me privately via this link.

DrewKimball avatar Aug 28 '24 04:08 DrewKimball

tsdump sent, with a detailed description

wuhuua avatar Aug 29 '24 04:08 wuhuua

It looks like CPU is intermittently reaching ~100% on individual nodes. KV requests to a high-utilization node are delayed, and since we're bulk-inserting, a given SQL statement is pretty likely to hit the slow node.

Screenshot 2024-08-29 at 12 09 59 PM Screenshot 2024-08-29 at 12 10 16 PM Screenshot 2024-08-29 at 12 10 31 PM Screenshot 2024-08-29 at 12 10 47 PM

DrewKimball avatar Aug 29 '24 18:08 DrewKimball

A good next step would be to try and figure out why CPU utilization is so high on certain nodes. We should be collecting CPU profiles automatically, which you can access via the logging directory cockroach-data/logs/pprof_dump. Let's grab the profiles for several different nodes to make sure we capture an interesting period.

I also have a few questions about your hardware/workload:

  • Is there a single CRDB node per instance?
  • How many vCPUs are there per instance?
  • How many rows are inserted per statement/transaction? Does this number vary?

DrewKimball avatar Aug 29 '24 18:08 DrewKimball

@wuhuua I'm going to close this issue for now. Please feel free to open a new issue (or reply to this issue) if you have further questions.

mgartner avatar Oct 31 '24 15:10 mgartner