foundationdb
foundationdb copied to clipboard
Add system monitor for flowprocess
This pr is a follow-up change on remote kvs. Several changes are added:
- Enable system monitor for flowprocess to have system-level metrics like CPU usage in traces. As the flowprocess starts with listen port as 0 and it's used to create the trace file name, we need to close the trace file and reopen it to give the trace file the right name containing the process's port number. Added a few helper functions for tracing.
- The existing code return Void() in remote kvs getError when it's stopped normally. This is an incorrect behavior carried from the beginning, found it and change to return
Never()
instead. With this change, remove the unnecessary error handling in handlOErrors to check isError where kv store getError only returns errors. - With remote kvs where the child kv store process is killed by SIGTERM, there can be a race that file locks are not released when we reboot the parent process due to the delay. The current code throws io_error when it cannot get the lock and log Sev40. Added a new lock_file_failure error for this scenario and log SevWarn instead as we will reboot the kv store to avoid the failure. There's a knob
REBOOT_KV_STORE_DELAY
to wait to avoid race. - The current code does not have the logic to reopen the kv store when rebooting the storage server, add the change to support it by changing the error
please_reboot_remote_kv_store
toplease_reboot_kv_store
. Whenever this error is seen, we assume the kv store failed and reopen the kv store. In the contrast,please_reboot_remote_kv_store
will reboot the whole worker. The logic is removed now, and we will only reboot the kv store.
The lock_file_failure
is thrown in the Writer
's work which is async and first caught by handleIOErrors, so it's unable to have a simple retry in openKVStore to handle this lock failure.
Passed 20K tests
Code-Reviewer Section
The general guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
- [ ] The PR has a description, explaining both the problem and the solution.
- [ ] The description mentions which forms of testing were done and the testing seems reasonable.
- [ ] Every function/class/actor that was touched is reasonably well documented.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
- [ ] This change/bugfix is a cherry-pick from the next younger branch (younger
release-branch
ormain
if this is the youngest branch) - [ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)
Doxense CI Report for Windows 10
- Commit ID: 2f60bab699e9c38ca71d2bd47a4fb11bc1d77ab4
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for Linux CentOS 7
- CodeBuild project: foundationdb-pr
- Commit ID: 2f60bab699e9c38ca71d2bd47a4fb11bc1d77ab4
- Result: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for macOS BigSur 11.5.2
- CodeBuild project: foundationdb-pr-macos
- Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
- Result: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for Linux CentOS 7
- CodeBuild project: foundationdb-pr
- Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
- Result: FAILED
- Error:
Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for Linux CentOS 7
- CodeBuild project: foundationdb-pr
- Commit ID: 6270e22b2cb17b4ece11dfca87eca13aacc49562
- Result: FAILED
- Error:
Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
- Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 6270e22b2cb17b4ece11dfca87eca13aacc49562
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for macOS BigSur 11.5.2
- CodeBuild project: foundationdb-pr-macos
- Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
- Result: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
AWS CodeBuild CI Report for Linux CentOS 7
- CodeBuild project: foundationdb-pr
- Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
- Result: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr-macos on macOS BigSur 11.5.2
- Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
- Duration 0:43:44
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
- Duration 1:50:24
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
- Duration 2:11:50
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A
- Build Logs (available for 30 days)