tikv icon indicating copy to clipboard operation
tikv copied to clipboard

storage: reject new commands if memory quota exceeded (#16473)

Open ti-chi-bot opened this issue 1 year ago • 4 comments

This is an automated cherry-pick of #16473

This cherry-pick rolls up three PRs:

  1. #16440
  2. #16473
  3. #16482

They are intended to be merged together.

What is changed and how it works?

Issue Number: ref #16234

What's Changed:

Currently, TiKV rejects new writes in the transaction layer if its
pending write bytes exceed a default threshold of 100MB. However, this
approach falls short as the transaction layer transforms a write
request into a Command and executes it as a Future. Both Command and
Future incur memory overhead. Empirical results from tests reveal that
the memory usage of `kv_prewrite` is 20 times larger than its written
bytes.

This commit introduces a memory quota that restricts the transaction
layer's memory usage. This addition acts as a crucial safeguard, serving
as the last resort to prevent TiKV from OOM.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • [x] Unit test
  • [x] Manual test
Test Details

The OOM issue in #16234 is hard to reproduce reliable, so I have to changes the default configs.

A single-node Cluster with the following configs.

TiKV:

[storage]
# Try not to limit concurrent tasks
scheduler-concurrency = 2097152
# Don’t let blockcache affect memory usage
[storage.block-cache]
capacity = "100MB"

TiDB:

lease = "600s"
token-limit = 100000000
[txn-local-latches]
enabled = false
SET GLOBAL tidb_txn_mode = 'optimistic';
SET GLOBAL tidb_enable_async_commit = off;
SET GLOBAL tidb_enable_1pc = off;

Workload:

# Prepare
mysql> create database tpcc1k;
/root/.tiup/components/bench/v1.12.0/go-tpc \
    tpcc prepare \
    -H 10.2.12.86 -P 31825 \
    -D tpcc1k --warehouses 1000 -T 500

# Run
while true; do { \
    /root/.tiup/components/bench/v1.12.0/go-tpc \
        tpcc run \
        -H 10.2.12.86 -P 31825 \
        -D tpcc1k --warehouses 1000 --time 4s -T 500 & \
    pid=$!; sleep 5; kill -9 $pid; \
} done;
TiKV Config Metrics
OOM if memory-quota is unlimited.
[storage]
# 128GB disables memory quota efficently.
memory-quota = "128GB"
image image
Does not OOM if memory-quota is configured properly.
[storage]
memory-quota = "128MB"
image

Release note

Fix an issue that txn scheduler may cause OOM if TiKV writes too slow.

ti-chi-bot avatar May 07 '24 05:05 ti-chi-bot

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Connor1996

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment. After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review. Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot[bot] avatar May 07 '24 05:05 ti-chi-bot[bot]

/test

overvenus avatar May 22 '24 07:05 overvenus

@overvenus: The /test command needs one or more targets. The following commands are available to trigger optional jobs:

  • /debug pull-unit-test

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar May 22 '24 07:05 ti-chi-bot[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar Jul 16 '24 21:07 ti-chi-bot[bot]

This cherry pick PR is for a release branch and has not yet been approved by triage owners. Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar Jul 22 '24 02:07 ti-chi-bot[bot]

@ti-chi-bot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 85e80d72b020a7907ccf6dc8e5d76ec7c96eb78b link true /test pull-unit-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ti-chi-bot[bot] avatar Sep 25 '24 10:09 ti-chi-bot[bot]