flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

Block computation result upload Retry implementation

Open Tonix517 opened this issue 2 years ago • 2 comments

#2273 - This PR is supposed to address the problem described in this issue, that sometimes ComputationResult upload failed but was not retried so it caused data loss in GCP.

Design

According to the issue #2273, this problem usually occurred at abrupt crash or reboot of EN process before all computation result upload succeeded. To address this, we will need to store the upload status and let EN pick up for retry at boot. The basic idea of this PR is simple:

  1. After after one ComputationResult is ready, we store one entry to mark the upload status as false with given ExecutionDataID into local BadgerDB
  2. When upload is completed successfully, the corresponding item in BadgerDB will be replaced with true
  3. If upload fails for certain ComputationResult items, EN process will pick them up and retry upload at reboot.

Metrics

Two new metrics are added in to monitor upload behavior:

  • Count of successful upload
  • Count of upload retry

Examples of these metrics of different scenarios can be found in the Test Plan section below.

Notes

  • ComputationResult reconstruction from EDS and BadgerDB We don't directly store whole ComputationResult instance into BadgerDB, because that'll take too much space. Actually all required fields in ComputationResult (the ones in BlockData) have already stored in EDS and BadgerDB, so we grab Events and TrieUpdates from EDS, and all other fields from BadgerDB to reconstruct ComputationResult for reload.

  • Storage overhead We don't expect it will take too much extra storage space in BadgerDB, since it will only be an extra boolean value per ComputationResult. We still keep these markers even after one ComputationResult upload is done for potential future devops uses.

  • AWS uploader Confirmed that we now only cares about GCP uploader. To support new uploaders we will need to introduce new key code for BadgerDB (more details in the comment of the PR)

  • Duplicated upload Confirmed that duplicated upload of the same ComputationResult record is fine and will not trigger any problems.

Test Plan

  • [X] UT: go test --tags=relic on all new unit test cases

  • [X] localnet: no stored ComputationResult status in DB to retry -> all upload succeeds retry_ no_prev _ upload_ok

  • [X] localnet: no stored ComputationResult status in DB to retry -> all upload fails retry_ no_prev _ upload_fail

  • [X] localnet: with stored ComputationResult status in DB to retry -> retry fails retry_ prev_some _ retry_upload_fail

  • [X] localnet: with stored ComputationResult status retry_ prev_some _ retry_upload_ok in DB to retry -> retry succeeds + all upload succeeds

  • [ ] devnet

Tonix517 avatar Jul 01 '22 22:07 Tonix517

Codecov Report

Merging #2743 (4ede54c) into master (9784c27) will increase coverage by 0.01%. The diff coverage is 49.20%.

@@            Coverage Diff             @@
##           master    #2743      +/-   ##
==========================================
+ Coverage   54.37%   54.39%   +0.01%     
==========================================
  Files         728      731       +3     
  Lines       67431    67692     +261     
==========================================
+ Hits        36667    36820     +153     
- Misses      27706    27799      +93     
- Partials     3058     3073      +15     
Flag Coverage Δ
unittests 54.39% <49.20%> (+0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/execution_builder.go 0.00% <0.00%> (ø)
engine/execution/computation/manager.go 79.89% <0.00%> (-0.84%) :arrow_down:
storage/badger/operation/prefix.go 79.31% <ø> (ø)
...on/computer/uploader/retryable_uploader_wrapper.go 58.22% <58.22%> (ø)
engine/execution/ingestion/engine.go 52.32% <63.63%> (+0.14%) :arrow_up:
storage/badger/operation/computation_result.go 85.18% <85.18%> (ø)
...xecution/computation/computer/uploader/uploader.go 82.92% <100.00%> (+2.92%) :arrow_up:
storage/badger/computation_result.go 100.00% <100.00%> (ø)
consensus/hotstuff/eventloop/event_loop.go 74.82% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Jul 01 '22 22:07 codecov-commenter

Update: did a quick offline review with @m4ksio . Things to refine: instead of storing ComputationResult into BadgerDB, which may introduce extra storage overhead(could be hundreds of GB), we can simply utilize data we already stored in EDS to re-construct ComputationResult at upload retry, and only store upload status flag in BadgerDB.

Tonix517 avatar Jul 18 '22 21:07 Tonix517

FVM Benchstat comparison

This branch with compared with the base branch onflow:master commit 40c77f2b31b58d03091c4227daa06a4ca51b2845

The command (for i in {1..10}; do go test ./fvm ./engine/execution/computation --bench . --tags relic -shuffle=on --benchmem --run ^$; done) was used.

Collapsed results for better readability

old.txtnew.txt
time/opdelta
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeNFTBatchTransfer-2113ms ±13%109ms ± 3%~(p=0.133 n=10+9)
RuntimeTransaction/reference_tx-226.3ms ± 6%25.7ms ±10%~(p=0.218 n=10+10)
RuntimeTransaction/convert_int_to_string-228.2ms ±15%27.7ms ±11%~(p=0.631 n=10+10)
RuntimeTransaction/get_signer_address-227.8ms ±10%27.9ms ±18%~(p=0.853 n=10+10)
RuntimeTransaction/get_account_and_get_available_balance-2260ms ± 6%257ms ± 4%~(p=0.218 n=10+10)
RuntimeTransaction/get_account_and_get_storage_used-232.0ms ±10%31.1ms ± 3%~(p=0.156 n=10+9)
RuntimeTransaction/get_account_and_get_storage_capacity-2226ms ± 2%224ms ± 3%~(p=0.222 n=9+9)
RuntimeTransaction/get_signer_vault-234.9ms ± 8%34.6ms ± 8%~(p=0.971 n=10+10)
RuntimeTransaction/get_signer_receiver-244.9ms ± 7%45.5ms ± 5%~(p=0.436 n=10+10)
RuntimeTransaction/transfer_tokens-2208ms ± 3%206ms ± 1%~(p=0.113 n=10+9)
RuntimeTransaction/load_and_save_empty_string_on_signers_address-234.1ms ± 8%33.4ms ± 8%~(p=0.393 n=10+10)
RuntimeTransaction/load_and_save_long_string_on_signers_address-275.3ms ± 4%75.3ms ± 4%~(p=0.971 n=10+10)
RuntimeTransaction/create_new_account-2818ms ± 2%809ms ± 2%~(p=0.075 n=10+10)
RuntimeTransaction/call_empty_contract_function-229.8ms ± 8%29.8ms ± 8%~(p=0.971 n=10+10)
RuntimeTransaction/emit_event-243.2ms ± 6%43.1ms ± 6%~(p=0.912 n=10+10)
RuntimeTransaction/borrow_array_from_storage-2131ms ± 2%130ms ± 2%~(p=0.315 n=9+10)
RuntimeTransaction/copy_array_from_storage-2134ms ± 7%132ms ± 4%~(p=0.436 n=10+10)
pkg:github.com/onflow/flow-go/engine/execution/computation goos:linux goarch:amd64
ComputeBlock/16/cols/128/txes-24.77s ± 4%4.70s ± 3%~(p=0.247 n=10+10)
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/get_public_account-229.8ms ± 8%28.7ms ± 5%−3.54%(p=0.036 n=9+8)
RuntimeTransaction/get_account_and_get_balance-2288ms ± 7%277ms ± 2%−3.94%(p=0.004 n=9+9)
RuntimeTransaction/convert_int_to_string_and_concatenate_it-230.1ms ±10%28.3ms ± 5%−6.20%(p=0.008 n=10+9)
 
alloc/opdelta
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/borrow_array_from_storage-268.8MB ± 2%69.9MB ± 1%+1.68%(p=0.033 n=10+7)
RuntimeNFTBatchTransfer-256.7MB ± 4%56.7MB ± 4%~(p=0.853 n=10+10)
RuntimeTransaction/convert_int_to_string-237.1MB ± 8%36.8MB ± 5%~(p=0.905 n=10+9)
RuntimeTransaction/convert_int_to_string_and_concatenate_it-237.3MB ± 7%36.3MB ± 5%~(p=0.218 n=10+10)
RuntimeTransaction/get_signer_address-236.6MB ± 6%36.7MB ± 9%~(p=0.796 n=10+10)
RuntimeTransaction/get_public_account-237.9MB ± 7%37.5MB ± 7%~(p=0.579 n=10+10)
RuntimeTransaction/get_account_and_get_balance-2130MB ± 2%131MB ± 1%~(p=0.515 n=10+8)
RuntimeTransaction/get_account_and_get_available_balance-2112MB ± 3%111MB ± 4%~(p=0.393 n=10+10)
RuntimeTransaction/get_account_and_get_storage_used-237.6MB ± 8%37.4MB ± 2%~(p=0.497 n=10+9)
RuntimeTransaction/get_account_and_get_storage_capacity-2107MB ± 2%105MB ± 5%~(p=0.190 n=10+10)
RuntimeTransaction/get_signer_vault-238.2MB ± 7%38.0MB ± 7%~(p=0.912 n=10+10)
RuntimeTransaction/get_signer_receiver-241.2MB ± 4%42.2MB ± 5%~(p=0.182 n=9+10)
RuntimeTransaction/transfer_tokens-290.5MB ± 3%90.4MB ± 2%~(p=0.684 n=10+10)
RuntimeTransaction/load_and_save_empty_string_on_signers_address-237.9MB ± 8%37.3MB ±10%~(p=0.436 n=10+10)
RuntimeTransaction/load_and_save_long_string_on_signers_address-257.4MB ± 4%57.4MB ± 3%~(p=0.912 n=10+10)
RuntimeTransaction/create_new_account-2204MB ± 2%204MB ± 0%~(p=0.792 n=10+6)
RuntimeTransaction/call_empty_contract_function-237.7MB ± 8%37.9MB ± 5%~(p=0.912 n=10+10)
RuntimeTransaction/emit_event-241.5MB ± 6%41.3MB ± 6%~(p=0.912 n=10+10)
RuntimeTransaction/copy_array_from_storage-281.8MB ± 7%82.9MB ± 2%~(p=0.447 n=10+9)
pkg:github.com/onflow/flow-go/engine/execution/computation goos:linux goarch:amd64
ComputeBlock/16/cols/128/txes-21.32GB ± 1%1.32GB ± 1%~(p=0.481 n=10+10)
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/reference_tx-237.4MB ± 8%36.1MB ± 6%−3.61%(p=0.029 n=10+10)
 
allocs/opdelta
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/emit_event-2142k ± 0%142k ± 0%+0.01%(p=0.043 n=10+10)
RuntimeNFTBatchTransfer-2295k ± 0%295k ± 0%~(p=0.114 n=10+7)
RuntimeTransaction/convert_int_to_string-294.5k ± 0%94.5k ± 0%~(p=0.724 n=10+10)
RuntimeTransaction/convert_int_to_string_and_concatenate_it-2109k ± 0%109k ± 0%~(p=0.148 n=10+10)
RuntimeTransaction/get_signer_address-285.5k ± 0%85.5k ± 0%~(p=0.218 n=10+9)
RuntimeTransaction/get_public_account-2109k ± 0%109k ± 0%~(p=0.170 n=10+10)
RuntimeTransaction/get_account_and_get_balance-21.55M ± 0%1.55M ± 0%~(p=0.565 n=10+10)
RuntimeTransaction/get_account_and_get_available_balance-21.43M ± 0%1.43M ± 0%~(p=0.123 n=10+10)
RuntimeTransaction/get_account_and_get_storage_used-2130k ± 0%130k ± 0%~(p=0.203 n=10+9)
RuntimeTransaction/get_account_and_get_storage_capacity-21.27M ± 0%1.27M ± 0%~(p=0.093 n=10+10)
RuntimeTransaction/get_signer_vault-2132k ± 0%132k ± 0%~(p=0.809 n=10+10)
RuntimeTransaction/get_signer_receiver-2213k ± 0%213k ± 0%~(p=0.156 n=10+10)
RuntimeTransaction/transfer_tokens-2962k ± 0%962k ± 0%~(p=0.098 n=9+10)
RuntimeTransaction/load_and_save_empty_string_on_signers_address-2131k ± 0%131k ± 0%~(p=0.424 n=10+10)
RuntimeTransaction/load_and_save_long_string_on_signers_address-2233k ± 0%233k ± 0%~(p=0.617 n=10+10)
RuntimeTransaction/create_new_account-22.72M ± 0%2.72M ± 0%~(p=0.971 n=10+10)
RuntimeTransaction/call_empty_contract_function-297.2k ± 0%97.2k ± 0%~(p=0.493 n=10+10)
RuntimeTransaction/borrow_array_from_storage-2370k ± 0%370k ± 0%~(p=0.644 n=10+10)
RuntimeTransaction/copy_array_from_storage-2326k ± 0%326k ± 0%~(p=1.000 n=10+10)
pkg:github.com/onflow/flow-go/engine/execution/computation goos:linux goarch:amd64
ComputeBlock/16/cols/128/txes-221.0M ± 0%21.0M ± 0%~(p=0.739 n=10+10)
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/reference_tx-280.3k ± 0%80.3k ± 0%−0.01%(p=0.050 n=10+10)
 
computationdelta
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/reference_tx-2202 ± 0%202 ± 0%~(all equal)
RuntimeTransaction/convert_int_to_string-2402 ± 0%402 ± 0%~(all equal)
RuntimeTransaction/convert_int_to_string_and_concatenate_it-2502 ± 0%502 ± 0%~(all equal)
RuntimeTransaction/get_signer_address-2302 ± 0%302 ± 0%~(all equal)
RuntimeTransaction/get_public_account-2402 ± 0%402 ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_balance-21.00k ± 0%1.00k ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_available_balance-22.60k ± 0%2.60k ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_storage_used-2402 ± 0%402 ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_storage_capacity-21.30k ± 0%1.30k ± 0%~(all equal)
RuntimeTransaction/get_signer_vault-2402 ± 0%402 ± 0%~(all equal)
RuntimeTransaction/get_signer_receiver-2602 ± 0%602 ± 0%~(all equal)
RuntimeTransaction/transfer_tokens-23.50k ± 0%3.50k ± 0%~(all equal)
RuntimeTransaction/load_and_save_empty_string_on_signers_address-2602 ± 0%602 ± 0%~(all equal)
RuntimeTransaction/load_and_save_long_string_on_signers_address-2602 ± 0%602 ± 0%~(all equal)
RuntimeTransaction/create_new_account-2202 ± 0%202 ± 0%~(all equal)
RuntimeTransaction/call_empty_contract_function-2402 ± 0%402 ± 0%~(all equal)
RuntimeTransaction/emit_event-2602 ± 0%602 ± 0%~(all equal)
RuntimeTransaction/borrow_array_from_storage-22.60k ± 0%2.60k ± 0%~(all equal)
RuntimeTransaction/copy_array_from_storage-22.60k ± 0%2.60k ± 0%~(all equal)
 
interactionsdelta
pkg:github.com/onflow/flow-go/fvm goos:linux goarch:amd64
RuntimeTransaction/reference_tx-244.4k ± 0%44.4k ± 0%~(all equal)
RuntimeTransaction/convert_int_to_string-244.4k ± 0%44.4k ± 0%~(all equal)
RuntimeTransaction/convert_int_to_string_and_concatenate_it-244.4k ± 0%44.4k ± 0%~(all equal)
RuntimeTransaction/get_signer_address-244.4k ± 0%44.4k ± 0%~(all equal)
RuntimeTransaction/get_public_account-244.4k ± 0%44.4k ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_balance-216.8M ± 0%16.8M ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_available_balance-25.28M ± 0%5.28M ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_storage_used-248.0k ± 0%48.0k ± 0%~(all equal)
RuntimeTransaction/get_account_and_get_storage_capacity-25.27M ± 0%5.27M ± 0%~(all equal)
RuntimeTransaction/get_signer_vault-244.7k ± 0%44.7k ± 0%~(all equal)
RuntimeTransaction/get_signer_receiver-245.0k ± 0%45.0k ± 0%~(all equal)
RuntimeTransaction/transfer_tokens-245.8k ± 0%45.8k ± 0%~(all equal)
RuntimeTransaction/load_and_save_empty_string_on_signers_address-244.8k ± 0%44.8k ± 0%~(all equal)
RuntimeTransaction/load_and_save_long_string_on_signers_address-249.7k ± 0%49.7k ± 0%~(all equal)
RuntimeTransaction/create_new_account-28.53M ± 0%8.53M ± 0%~(all equal)
RuntimeTransaction/call_empty_contract_function-244.6k ± 0%44.6k ± 0%~(all equal)
RuntimeTransaction/emit_event-244.6k ± 0%44.6k ± 0%~(all equal)
RuntimeTransaction/borrow_array_from_storage-249.8k ± 0%49.8k ± 0%~(all equal)
RuntimeTransaction/copy_array_from_storage-249.8k ± 0%49.8k ± 0%~(all equal)
 
us/txdelta
pkg:github.com/onflow/flow-go/engine/execution/computation goos:linux goarch:amd64
ComputeBlock/16/cols/128/txes-22.33k ± 4%2.30k ± 3%~(p=0.225 n=10+10)
 

github-actions[bot] avatar Aug 25 '22 20:08 github-actions[bot]