Add physical shard dd core (don't merge)
This PR introduces PhysicalShard concept to Data Distribution
The feature is protected by ENABLE_DD_PHYSICAL_SHARD.
ENABLE_DD_PHYSICAL_SHARD replies on SHARD_ENCODE_LOCATION_METADATA.
Please make sure SHARD_ENCODE_LOCATION_METADATA is set when setting ENABLE_DD_PHYSICAL_SHARD.
The core data structure is PhysicalShardCollection, which is responsible for the creation and maintenance of physical shards in data distribution (A physical shard contains multiple key ranges, aka shards). PhysicalShardCollection has two jobs:
- Create physical shards. Once a physical shard is created, it does not change, except for its metrics.
- Update physical shard metrics. A physical shard metric is updated by shard trackers.
- Transition. If a system with no PhysicalShard concept restarts, all keyRanges are in the anonymousShard. We gradually move the keyRanges out of the anonymousShard until the system enters a state where no anonymousShard is in the system.
PhysicalShardCollection is initialized when loading iShard in resumeFromShards. PhysicalShardCollection is updated when getting dest teams in dataDistributionRelocator. When a dest team is decided, PhysicalShardCollection picks a physical shard from the team. If the team has no physical shard, PhysicalShardCollection creates a physical shard for the team.
If the cluster has multiple DCs, PhysicalShardCollection makes a pair of primary and remote teams. Once a pair is created, the pair does not change later. A primary team and its paired remote team share the same physical shard.
When deciding the dest teams of a shard (key-range), the primary team is determined by getTeam, then PhysicalShardCollection selects a physical shard from the primary team, which decides the remote team. Note that a remote team selected in this way may be an unhealthy team. If this is the case, PhysicalShardCollection selects the remote team by getTeam.
Note that (1) the current design of PhysicalShardCollection assumes that there exist at most two teamCollections (one primary team and one remote team); (2) When ENABLE_DD_PHYSICAL_SHARD is set, the optimization of saving traffic for data move between DCs is disabled.
This PR fixes a ddstuck issue triggered by restore data moves For a restored data move, the destination team does not change. As a result, the restored data move may repeatedly move data to a busy destination team. To solve this issue, this PR simply cancels the data move as the case when ddstuckCount > 50. Currently, this fix is protected by the feature flag. Further discussion on safety is required.
Correctness test:
ENABLE_DD_PHYSICAL_SHARD off and SHARD_ENCODE_LOCATION_METADATA off:
20220810-201038-zhewang-03e13a48eed92748 compressed=True data_size=36492572 duration=4876426 ended=100002 fail_fast=10 max_runs=100000 pass=100002 priority=100 remaining=0 runtime=0:30:39 sanity=False started=100072 stopped=20220810-204117 submitted=20220810-201038 timeout=5400 username=zhewang
ENABLE_DD_PHYSICAL_SHARD on and SHARD_ENCODE_LOCATION_METADATA on (exclude rocksdb and shardedrocks):
20220810-183204-zhewang-c8bed0053dab6fbb compressed=True data_size=36492597 duration=4718562 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:46:12 sanity=False started=100094 stopped=20220810-191816 submitted=20220810-183204 timeout=5400 username=zhewang
Code-Reviewer Section
The general guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
- [ ] The PR has a description, explaining both the problem and the solution.
- [ ] The description mentions which forms of testing were done and the testing seems reasonable.
- [ ] Every function/class/actor that was touched is reasonably well documented.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
- [ ] This change/bugfix is a cherry-pick from the next younger branch (younger
release-branchormainif this is the youngest branch) - [ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 856e578a9b84c0fec4b5f53c63180ab5332cf8c2
- Duration 0:09:58
- Result: :x: FAILED
- Error:
reference not found for primary source and source version 856e578a9b84c0fec4b5f53c63180ab5332cf8c2 - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 856e578a9b84c0fec4b5f53c63180ab5332cf8c2
- Duration 0:10:06
- Result: :x: FAILED
- Error:
reference not found for primary source and source version 856e578a9b84c0fec4b5f53c63180ab5332cf8c2 - Build Logs (available for 30 days)
Result of foundationdb-pr-macos on macOS BigSur 11.5.2
- Commit ID: 096aaee9fafda624bbbb49cf3a94aefc0c05b310
- Duration 0:31:17
- Result: :x: FAILED
- Error:
Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1 - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 3cf8eb9e55db1e9e0ae109f49d29f31227d473ef
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 3cf8eb9e55db1e9e0ae109f49d29f31227d473ef
- Duration 1:00:45
- Result: :x: FAILED
- Error:
Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1 - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 096aaee9fafda624bbbb49cf3a94aefc0c05b310
- Duration 1:15:50
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 096aaee9fafda624bbbb49cf3a94aefc0c05b310
- Duration 1:29:06
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 3cf8eb9e55db1e9e0ae109f49d29f31227d473ef
- Duration 1:18:37
- Result: :x: FAILED
- Error:
Error while executing command: make -C tests test_ycsb.run. Reason: exit status 2 - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: db8817dbdc3af21793114e59c4952bed956b5ca4
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: db8817dbdc3af21793114e59c4952bed956b5ca4
- Duration 1:00:47
- Result: :x: FAILED
- Error:
Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1 - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: db8817dbdc3af21793114e59c4952bed956b5ca4
- Duration 1:50:22
- Result: :x: FAILED
- Error:
Error while executing command: make -C tests test_ycsb.run. Reason: exit status 2 - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 44e7a436d8b5beb73486247ef0e7986f9762dc62
- Duration 0:05:48
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1 - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 44e7a436d8b5beb73486247ef0e7986f9762dc62
- Duration 0:15:15
- Result: :x: FAILED
- Error:
reference not found for primary source and source version 44e7a436d8b5beb73486247ef0e7986f9762dc62 - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 7fd94803d70212224d5521f559d9587130f66a82
- Duration 0:11:42
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1 - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 7fd94803d70212224d5521f559d9587130f66a82
- Duration 0:20:11
- Result: :x: FAILED
- Error:
Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1 - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 7fd94803d70212224d5521f559d9587130f66a82
- Result: :x: FAILED
- Build Logs (available for 30 days)
Result of foundationdb-pr-macos on macOS BigSur 11.5.2
- Commit ID: 44e7a436d8b5beb73486247ef0e7986f9762dc62
- Duration 0:43:58
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: ab902df8cdf78bdd85cec078f46955e892ff2304
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: ab902df8cdf78bdd85cec078f46955e892ff2304
- Duration 1:06:15
- Result: :x: FAILED
- Error:
Error while executing command: make -C tests compile. Reason: exit status 2 - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: ab902df8cdf78bdd85cec078f46955e892ff2304
- Duration 1:17:09
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 05e1f496188b143649fad0a8d20688777d40c0b7
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 05e1f496188b143649fad0a8d20688777d40c0b7
- Duration 1:11:53
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 05e1f496188b143649fad0a8d20688777d40c0b7
- Duration 1:26:25
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 9247125049e3097362849f47cb7e1bf5ae959483
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr-macos on macOS BigSur 11.5.2
- Commit ID: 9247125049e3097362849f47cb7e1bf5ae959483
- Duration 0:43:45
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 9247125049e3097362849f47cb7e1bf5ae959483
- Duration 0:56:54
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 9247125049e3097362849f47cb7e1bf5ae959483
- Duration 1:44:48
- Result: :x: FAILED
- Error:
Error while executing command: make -C tests test_ycsb.run. Reason: exit status 2 - Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 65a387f2b94709702a83a6d915fec0dd3a94b3d5
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 65a387f2b94709702a83a6d915fec0dd3a94b3d5
- Duration 1:02:30
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 65a387f2b94709702a83a6d915fec0dd3a94b3d5
- Duration 1:24:39
- Result: :x: FAILED
- Error:
Error while executing command: make -C tests test_ycsb.run. Reason: exit status 2 - Build Logs (available for 30 days)