foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Add system monitor for flowprocess

Open sfc-gh-clin opened this issue 2 years ago • 14 comments

This pr is a follow-up change on remote kvs. Several changes are added:

  • Enable system monitor for flowprocess to have system-level metrics like CPU usage in traces. As the flowprocess starts with listen port as 0 and it's used to create the trace file name, we need to close the trace file and reopen it to give the trace file the right name containing the process's port number. Added a few helper functions for tracing.
  • The existing code return Void() in remote kvs getError when it's stopped normally. This is an incorrect behavior carried from the beginning, found it and change to return Never() instead. With this change, remove the unnecessary error handling in handlOErrors to check isError where kv store getError only returns errors.
  • With remote kvs where the child kv store process is killed by SIGTERM, there can be a race that file locks are not released when we reboot the parent process due to the delay. The current code throws io_error when it cannot get the lock and log Sev40. Added a new lock_file_failure error for this scenario and log SevWarn instead as we will reboot the kv store to avoid the failure. There's a knob REBOOT_KV_STORE_DELAY to wait to avoid race.
  • The current code does not have the logic to reopen the kv store when rebooting the storage server, add the change to support it by changing the error please_reboot_remote_kv_store to please_reboot_kv_store. Whenever this error is seen, we assume the kv store failed and reopen the kv store. In the contrast, please_reboot_remote_kv_store will reboot the whole worker. The logic is removed now, and we will only reboot the kv store.

The lock_file_failure is thrown in the Writer's work which is async and first caught by handleIOErrors, so it's unable to have a simple retry in openKVStore to handle this lock failure.


Passed 20K tests

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • [ ] The PR has a description, explaining both the problem and the solution.
  • [ ] The description mentions which forms of testing were done and the testing seems reasonable.
  • [ ] Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • [ ] This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • [ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

sfc-gh-clin avatar Apr 21 '22 22:04 sfc-gh-clin

Doxense CI Report for Windows 10

  • Commit ID: 2f60bab699e9c38ca71d2bd47a4fb11bc1d77ab4
  • Result: :heavy_check_mark: SUCCEEDED
  • Build Logs (available for 30 days)

fdb-windows-ci avatar Apr 21 '22 23:04 fdb-windows-ci

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 2f60bab699e9c38ca71d2bd47a4fb11bc1d77ab4
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 21 '22 23:04 foundationdb-ci

AWS CodeBuild CI Report for macOS BigSur 11.5.2

  • CodeBuild project: foundationdb-pr-macos
  • Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 26 '22 18:04 foundationdb-ci

Doxense CI Report for Windows 10

  • Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
  • Result: :heavy_check_mark: SUCCEEDED
  • Build Logs (available for 30 days)

fdb-windows-ci avatar Apr 26 '22 19:04 fdb-windows-ci

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 90da4dcae15994b51541fd9534fc6ffb38b749ea
  • Result: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 26 '22 19:04 foundationdb-ci

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 6270e22b2cb17b4ece11dfca87eca13aacc49562
  • Result: FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 26 '22 21:04 foundationdb-ci

Doxense CI Report for Windows 10

  • Commit ID: 6270e22b2cb17b4ece11dfca87eca13aacc49562
  • Result: :heavy_check_mark: SUCCEEDED
  • Build Logs (available for 30 days)

fdb-windows-ci avatar Apr 26 '22 22:04 fdb-windows-ci

AWS CodeBuild CI Report for macOS BigSur 11.5.2

  • CodeBuild project: foundationdb-pr-macos
  • Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 26 '22 23:04 foundationdb-ci

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Apr 27 '22 00:04 foundationdb-ci

Doxense CI Report for Windows 10

  • Commit ID: 65ff16ce2bfaa398bfa162b7b14b5b5b7d101733
  • Result: :heavy_check_mark: SUCCEEDED
  • Build Logs (available for 30 days)

fdb-windows-ci avatar Apr 27 '22 00:04 fdb-windows-ci

Result of foundationdb-pr-macos on macOS BigSur 11.5.2

  • Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
  • Duration 0:43:44
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Aug 10 '22 22:08 foundationdb-ci

Doxense CI Report for Windows 10

  • Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
  • Result: :heavy_check_mark: SUCCEEDED
  • Build Logs (available for 30 days)

fdb-windows-ci avatar Aug 10 '22 23:08 fdb-windows-ci

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
  • Duration 1:50:24
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Aug 10 '22 23:08 foundationdb-ci

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 128ea6e47cc4a6415396de6a4e5b5edfb59abb9f
  • Duration 2:11:50
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Aug 11 '22 00:08 foundationdb-ci