foundationdb
foundationdb copied to clipboard
Added new commit statistics metrics
Problem statement
Now it is hard to tune a fdb cluster for a write-intensive workload.
Description
While tuning a fdb cluster with a write-intensive application often the bottleneck is the commit latency: when trying to parallel degree of transactions payload, the commit latency grows and prevents increasing the transaction throughput.
There are lots of conditions and knobs influencing the commit latency: number of commit proxies, number of resolvers, number of tlog processes, commit batching knobs: MAX_COMMIT_BATCH_INTERVAL, COMMIT_TRANSACTION_BATCH_INTERVAL_MAX, COMMIT_TRANSACTION_BATCH_INTERVAL_SMOOTHER_ALPHA and others.
But for now, there is no any information, where is the root cause of the high commit latency, so it is unclear, what is to be changed.
Proposal
- To collect and to log latency statistics from parts of commit workflow:
- Waiting for a batch
- Preresolution (allocating a commit version)
- Resolution
- Postresolution
- Pushing to TLog
- Replying
- To collect and log the batch size statistics
- The number of transactions in one batch
- The total bytes in the transaction batch
PR content
This PR implements this proposal: the following new metrics are logged and exposed in status json
:
- CommitBatchTransactions (commit_batch_transactions) - the number of transactions in one batch
- CommitBatchBytes (commit_batch_bytes) - the total size of one commit batch in bytes
- CommitBatchingWaiting (commit_batching_waiting) - the time while the transaction is waiting for the batch becomes ready
- CommitPreresolutionLatency (commit_preresolution_latency) - the time of the Preresolution phase
- CommitResolutionLatency (commit_resolution_latency) - the time of the Resolution phase
- CommitPostResolutionLatency (commit_resolution_latency) - the time of the Postresolution phase
- CommitTLogLoggingLatency (commit_tlog_logging_latency) - the time of the TlogLogging phase
- CommitReplyLatency (commit_reply_latency) - the time of Reply phase
The sum of the added *_latency_mean metrics should be equal to the commit_latency_mean metrics that already exists
Result of foundationdb-pr-clang-ide on Linux CentOS 7
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:12:41
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:16:01
- Result: :x: FAILED
- Error:
Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-clang on Linux CentOS 7
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:17:40
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:17:48
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
- Cluster Test Logs zip file of the test logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:18:14
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos on macOS Ventura 13.x
- Commit ID: 03cf4dcfcbe09a455cfa9d00c87b14ec021328ee
- Duration 0:22:37
- Result: :x: FAILED
- Error:
Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-clang-ide on Linux CentOS 7
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 0:20:46
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 0:31:37
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos on macOS Ventura 13.x
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 0:46:17
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-clang on Linux CentOS 7
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 1:06:50
- Result: :x: FAILED
- Error:
Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 1:07:00
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 8ae2ad1f0b52ce15deb4ee321839f14127bd2929
- Duration 1:24:26
- Result: :x: FAILED
- Error:
Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
- Cluster Test Logs zip file of the test logs (available for 30 days)
Seems two first tests faled due an environment problems:
Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
errors pretty printing info
Result of foundationdb-pr-clang-ide on Linux CentOS 7
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 0:21:28
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 0:33:34
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-macos on macOS Ventura 13.x
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 0:47:04
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-clang on Linux CentOS 7
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 1:17:15
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 1:22:48
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: a6b8f5437919e806c3b443329d190bc7a55b2df1
- Duration 1:29:57
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Log terminal output (available for 30 days)
- Build Workspace zip file of the working directory (available for 30 days)
- Cluster Test Logs zip file of the test logs (available for 30 days)
@sbodagala can you run a correctness for this PR before merging?
Ran a correctness test (with 100000 simulation tests). The test run stopped after doing 99994 test runs, at which point 10 simulation tests have failed (not all of these failures may have been caused by this change set though).
ended=99994 fail=10 fail_fast=10 max_runs=100000
Majority of the failures are on tests in "tests/restarting/from_7.3.0/" directory:
SourceVersion="9cbaacab35d8c1e1230497afd6fec9fbd177cf51"
WillRestart="0", RandomSeed="1062210747", BuggifyEnabled="1", TestFile="tests/fast/RandomUnitTests.toml"
WillRestart="1", RandomSeed="1380708830", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"
WillRestart="0", RandomSeed="1380708831", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"
WillRestart="1", RandomSeed="2128522034", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"
WillRestart="0", RandomSeed="2128522035", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"
WillRestart="0", RandomSeed="398579556", BuggifyEnabled="0", TestFile="tests/fast/BlobGranuleVerifySmall.toml"
WillRestart="1", RandomSeed="110402234", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"
WillRestart="0", RandomSeed="110402235", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"
WillRestart="1", RandomSeed="3174333056", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"
WillRestart="0", RandomSeed="3174333057", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"
WillRestart="1", RandomSeed="869995033", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"
WillRestart="0", RandomSeed="869995034", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"
WillRestart="1", RandomSeed="630498846", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-1.txt"
WillRestart="0", RandomSeed="630498847", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-2.txt"