foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Added new commit statistics metrics

Open oleg68 opened this issue 1 year ago • 20 comments

Problem statement

Now it is hard to tune a fdb cluster for a write-intensive workload.

Description

While tuning a fdb cluster with a write-intensive application often the bottleneck is the commit latency: when trying to parallel degree of transactions payload, the commit latency grows and prevents increasing the transaction throughput.

There are lots of conditions and knobs influencing the commit latency: number of commit proxies, number of resolvers, number of tlog processes, commit batching knobs: MAX_COMMIT_BATCH_INTERVAL, COMMIT_TRANSACTION_BATCH_INTERVAL_MAX, COMMIT_TRANSACTION_BATCH_INTERVAL_SMOOTHER_ALPHA and others.

But for now, there is no any information, where is the root cause of the high commit latency, so it is unclear, what is to be changed.

Proposal

  • To collect and to log latency statistics from parts of commit workflow:
    • Waiting for a batch
    • Preresolution (allocating a commit version)
    • Resolution
    • Postresolution
    • Pushing to TLog
    • Replying
  • To collect and log the batch size statistics
    • The number of transactions in one batch
    • The total bytes in the transaction batch

PR content

This PR implements this proposal: the following new metrics are logged and exposed in status json:

  • CommitBatchTransactions (commit_batch_transactions) - the number of transactions in one batch
  • CommitBatchBytes (commit_batch_bytes) - the total size of one commit batch in bytes
  • CommitBatchingWaiting (commit_batching_waiting) - the time while the transaction is waiting for the batch becomes ready
  • CommitPreresolutionLatency (commit_preresolution_latency) - the time of the Preresolution phase
  • CommitResolutionLatency (commit_resolution_latency) - the time of the Resolution phase
  • CommitPostResolutionLatency (commit_resolution_latency) - the time of the Postresolution phase
  • CommitTLogLoggingLatency (commit_tlog_logging_latency) - the time of the TlogLogging phase
  • CommitReplyLatency (commit_reply_latency) - the time of Reply phase

The sum of the added *_latency_mean metrics should be equal to the commit_latency_mean metrics that already exists

oleg68 avatar Oct 18 '23 16:10 oleg68

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:12:41
  • Result: :x: FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:16:01
  • Result: :x: FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:17:40
  • Result: :x: FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:17:48
  • Result: :x: FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:18:14
  • Result: :x: FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
  • Duration 0:22:37
  • Result: :x: FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 18 '23 16:10 foundationdb-ci

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 0:20:46
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 19 '23 07:10 foundationdb-ci

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 0:31:37
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 19 '23 07:10 foundationdb-ci

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 0:46:17
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 19 '23 08:10 foundationdb-ci

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 1:06:50
  • Result: :x: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 19 '23 08:10 foundationdb-ci

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 1:07:00
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 19 '23 08:10 foundationdb-ci

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
  • Duration 1:24:26
  • Result: :x: FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci avatar Oct 19 '23 08:10 foundationdb-ci

Seems two first tests faled due an environment problems:

Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
errors pretty printing info

oleg68 avatar Oct 19 '23 11:10 oleg68

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 0:21:28
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 31 '23 12:10 foundationdb-ci

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 0:33:34
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 31 '23 12:10 foundationdb-ci

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 0:47:04
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 31 '23 12:10 foundationdb-ci

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 1:17:15
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 31 '23 13:10 foundationdb-ci

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 1:22:48
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci avatar Oct 31 '23 13:10 foundationdb-ci

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
  • Duration 1:29:57
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci avatar Oct 31 '23 13:10 foundationdb-ci

@sbodagala can you run a correctness for this PR before merging?

Ran a correctness test (with 100000 simulation tests). The test run stopped after doing 99994 test runs, at which point 10 simulation tests have failed (not all of these failures may have been caused by this change set though).

ended=99994 fail=10 fail_fast=10 max_runs=100000

Majority of the failures are on tests in "tests/restarting/from_7.3.0/" directory:

SourceVersion="9cbaacab35d8c1e1230497afd6fec9fbd177cf51"

WillRestart="0", RandomSeed="1062210747", BuggifyEnabled="1", TestFile="tests/fast/RandomUnitTests.toml"

WillRestart="1", RandomSeed="1380708830", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"

WillRestart="0",  RandomSeed="1380708831", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"

WillRestart="1", RandomSeed="2128522034", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"

WillRestart="0", RandomSeed="2128522035", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"

 WillRestart="0", RandomSeed="398579556", BuggifyEnabled="0", TestFile="tests/fast/BlobGranuleVerifySmall.toml"

 WillRestart="1", RandomSeed="110402234", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="110402235", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="3174333056", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="3174333057", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="869995033", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="869995034", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="630498846", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-1.txt"

WillRestart="0", RandomSeed="630498847", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-2.txt"

sbodagala avatar Nov 03 '23 16:11 sbodagala