aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

Telemetry Service Fixes and Enhancements

Open ibalajiarun opened this issue 2 years ago • 8 comments

Description

This PR packs a plethora of upgrades and fixes to the telemetry service.

  • Fixes GCP trace ID key in log entry. GCP needs the trace key to be exactly "logging.googleapis.com/trace" to group logs.
  • Prometheus metrics counters for various backend operations.
  • Metrics replication support to multiple backend.
  • Backend request retry support.
  • Project name, service name, and container ID labels for service metrics
  • Service-level error codes

Test Plan

All changes have been deployed and working in prod fine.


This change is Reviewable

ibalajiarun avatar Oct 07 '22 19:10 ibalajiarun

Forge is running suite land_blocking on f7a539eb6489c2b3afbe617e9c317c8a81fbc717

github-actions[bot] avatar Oct 17 '22 21:10 github-actions[bot]

Forge is running suite compat on 843b204dce971d98449b82624f4f684c7a18b991 ==> f7a539eb6489c2b3afbe617e9c317c8a81fbc717

github-actions[bot] avatar Oct 17 '22 21:10 github-actions[bot]

:x: Forge suite compat failure on 843b204dce971d98449b82624f4f684c7a18b991 ==> f7a539eb6489c2b3afbe617e9c317c8a81fbc717

Forge test runner terminated:
Trailing Log Lines:
Error from server (BadRequest): container "genesis" in pod "genesis-aptos-genesis-eforge125-9qcqs" is waiting to start: trying and failing to pull image
{"level":"INFO","source":{"package":"forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:81"},"thread_name":"main","hostname":"forge-compat-pr-4875-1666042813-843b204dce971d98449b82624f4f684","timestamp":"2022-10-17T21:50:54.717635Z","message":"Genesis status: JobStatus { active: Some(1), completion_time: None, conditions: None, failed: None, start_time: Some(Time(2022-10-17T21:41:08Z)), succeeded: None }"}
{"level":"INFO","source":{"package":"forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:63"},"thread_name":"main","hostname":"forge-compat-pr-4875-1666042813-843b204dce971d98449b82624f4f684","timestamp":"2022-10-17T21:51:04.740894Z","message":"Genesis status: JobStatus { active: Some(1), completion_time: None, conditions: None, failed: None, start_time: Some(Time(2022-10-17T21:41:08Z)), succeeded: None }"}
Error from server (BadRequest): container "genesis" in pod "genesis-aptos-genesis-eforge125-9qcqs" is waiting to start: trying and failing to pull image
{"level":"INFO","source":{"package":"forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:81"},"thread_name":"main","hostname":"forge-compat-pr-4875-1666042813-843b204dce971d98449b82624f4f684","timestamp":"2022-10-17T21:51:04.810584Z","message":"Genesis status: JobStatus { active: Some(1), completion_time: None, conditions: None, failed: None, start_time: Some(Time(2022-10-17T21:41:08Z)), succeeded: None }"}
{"level":"INFO","source":{"package":"forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:280"},"thread_name":"main","hostname":"forge-compat-pr-4875-1666042813-843b204dce971d98449b82624f4f684","timestamp":"2022-10-17T21:51:04.829667Z","message":"Deleting namespace forge-compat-pr-4875: Some(NamespaceStatus { phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:388"},"thread_name":"main","hostname":"forge-compat-pr-4875-1666042813-843b204dce971d98449b82624f4f684","timestamp":"2022-10-17T21:51:04.829690Z","message":"aptos-node resources for Forge removed in namespace: forge-compat-pr-4875"}
Failed to run tests:
Genesis did not succeed
Error: Genesis did not succeed
Debugging output:
NAME                                    READY   STATUS             RESTARTS   AGE
genesis-aptos-genesis-eforge125-9qcqs   0/1     ImagePullBackOff   0          10m

github-actions[bot] avatar Oct 17 '22 21:10 github-actions[bot]

:white_check_mark: Forge suite land_blocking success on f7a539eb6489c2b3afbe617e9c317c8a81fbc717

performance benchmark with full nodes : 6608 TPS, 6007 ms latency, 8700 ms p99 latency,(!) expired 860 out of 2822520 txns
Test Ok

github-actions[bot] avatar Oct 17 '22 21:10 github-actions[bot]

Forge is running suite land_blocking on 5a094646e33ce7c01db223cf24824d79dc4b7e3b

github-actions[bot] avatar Nov 05 '22 16:11 github-actions[bot]

Forge is running suite compat on 2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 5a094646e33ce7c01db223cf24824d79dc4b7e3b

github-actions[bot] avatar Nov 05 '22 16:11 github-actions[bot]

:white_check_mark: Forge suite land_blocking success on 5a094646e33ce7c01db223cf24824d79dc4b7e3b

performance benchmark with full nodes : 6657 TPS, 5851 ms latency, 22900 ms p99 latency,(!) expired 5714 out of 2848640 txns
Test Ok

github-actions[bot] avatar Nov 05 '22 16:11 github-actions[bot]

:white_check_mark: Forge suite compat success on 2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 5a094646e33ce7c01db223cf24824d79dc4b7e3b

Compatibility test results for 2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 5a094646e33ce7c01db223cf24824d79dc4b7e3b (PR)
1. Check liveness of validators at old version: 2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7048 TPS, 5567 ms latency, 9900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 5a094646e33ce7c01db223cf24824d79dc4b7e3b
compatibility::simple-validator-upgrade::single-validator-upgrade : 4559 TPS, 9063 ms latency, 12000 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 5a094646e33ce7c01db223cf24824d79dc4b7e3b
compatibility::simple-validator-upgrade::half-validator-upgrade : 4750 TPS, 8805 ms latency, 11900 ms p99 latency,no expired txns
4. upgrading second batch to new version: 5a094646e33ce7c01db223cf24824d79dc4b7e3b
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6064 TPS, 6731 ms latency, 13700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for 2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 5a094646e33ce7c01db223cf24824d79dc4b7e3b passed
Test Ok

github-actions[bot] avatar Nov 05 '22 16:11 github-actions[bot]

Forge is running suite land_blocking on 5a094646e33ce7c01db223cf24824d79dc4b7e3b

github-actions[bot] avatar Nov 08 '22 19:11 github-actions[bot]

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 5a094646e33ce7c01db223cf24824d79dc4b7e3b

github-actions[bot] avatar Nov 08 '22 19:11 github-actions[bot]

Forge is running suite land_blocking on e2d3da472886eb794cd855fbf91af816d45c0b1f

github-actions[bot] avatar Nov 08 '22 20:11 github-actions[bot]

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e2d3da472886eb794cd855fbf91af816d45c0b1f

github-actions[bot] avatar Nov 08 '22 20:11 github-actions[bot]

:white_check_mark: Forge suite land_blocking success on e2d3da472886eb794cd855fbf91af816d45c0b1f

performance benchmark with full nodes : 6863 TPS, 5683 ms latency, 19000 ms p99 latency,(!) expired 7416 out of 2938060 txns
Test Ok

github-actions[bot] avatar Nov 08 '22 20:11 github-actions[bot]

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e2d3da472886eb794cd855fbf91af816d45c0b1f

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e2d3da472886eb794cd855fbf91af816d45c0b1f (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7382 TPS, 5254 ms latency, 8200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: e2d3da472886eb794cd855fbf91af816d45c0b1f
compatibility::simple-validator-upgrade::single-validator-upgrade : 4272 TPS, 9649 ms latency, 12400 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: e2d3da472886eb794cd855fbf91af816d45c0b1f
compatibility::simple-validator-upgrade::half-validator-upgrade : 5058 TPS, 7935 ms latency, 10500 ms p99 latency,no expired txns
4. upgrading second batch to new version: e2d3da472886eb794cd855fbf91af816d45c0b1f
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6727 TPS, 5905 ms latency, 9500 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e2d3da472886eb794cd855fbf91af816d45c0b1f passed
Test Ok

github-actions[bot] avatar Nov 08 '22 20:11 github-actions[bot]