harmony icon indicating copy to clipboard operation
harmony copied to clipboard

new status page of all public services

Open LeoHChen opened this issue 3 years ago • 10 comments

Summary

New public status page of harmony blockchain and services

Current Design

We currently have a status page https://status.harmony.one/ to display the status of mostly the internal bootnode, validator nodes, and explorer nodes.

Problems

The current status page didn't capture all the public service like uptime of the RPC endpoints that may impact users. We need to improve it and also provide a single source of truth regarding the incidents and response.

Proposal

In the new status page, we shall display the uptime/availability of the following services on the mainnet. We may consider to add a similar page to display the status of the testnet later.

  • bootstrap nodes uptime, checking the connectivity of the specific port dig txt _dnsaddr.bootstrap.t.hmny.io
  • uptime of all API RPC endpoints, checking the connectivity of the RPC port api.harmony.one, api.s0.t.hmny.io
  • uptime of the WSS endpoints, using WebSocket connectivity check ws.s0.t.hmny.io
  • explorer, the frontend and backend service to serve the https://explorer.harmony.one
  • staking dashboard, the frontend and backend service to serve the https://staking.harmony.one
  • graph nodes backend
  • bridge service, the frontend, and backend service to serve https://bridge.harmony.one, both ETH and BSC bridges
  • multi-sig service, the frontend, and backend service to serve https://multisig.harmony.one

Please add a link to the network metrics page as well. https://monitor.hmny.io/status

Reference

https://status.slack.com/

LeoHChen avatar Jun 15 '21 01:06 LeoHChen

@LeoHChen as well as checking for endpoint uptime, what do you say we also use synthetic monitoring to inspect the response payload of key API methods to ensure data structure is valid?

givp avatar Jun 15 '21 15:06 givp

Agree. It would be better to monitor the uptime/response time of a few key APIs. A more systematic way of monitoring RPC calls would need to add instruments to the node to keep track of the number and response time of all RPC calls. However, for now, we can just add a list of key APIs that we need to monitor.

@gupadhyaya , needs your input on which API we shall monitor in our status/dashboard?

LeoHChen avatar Jun 15 '21 18:06 LeoHChen

There is already a request to track the response time of trace_block coming from @hypnagonia , https://github.com/harmony-one/harmony/issues/3780

curl --request POST 'http://54.189.61.183:9500' --header 'Content-Type: application/json' --data-raw '{
    "jsonrpc": "2.0",
    "method": "trace_block",
    "params": ["0xd6739e"],
    "id": 1
}'

LeoHChen avatar Jun 15 '21 18:06 LeoHChen

We need hmy_getTransactionReceipt, hmyv2_getTransactionReceipt and web socket subscription to Logs, which I think is calling hmy_getLogs. These are the two keys APIs for bridge.

gupadhyaya avatar Jun 15 '21 18:06 gupadhyaya

May be it is also worthwhile to add following APIs:

  • hmy_getTransactionsHistory & hmyv2_getTransactionsHistory - related to account page loading
  • hmy_call & hmyv2_call - for smart contract calls.

gupadhyaya avatar Jun 15 '21 18:06 gupadhyaya

what will be the frequency for these critical apis in the monitoring system? if we can extend a bit more, we could also include

  • hmy_getTransactionByHash & hmyv2_getTransactionByHash - indicates tx exists in the blockchain
  • hmy_sendRawTransaction * hmyv2_sendRawTransaction - for normal transfers and any simple smart contract execution

gupadhyaya avatar Jun 15 '21 19:06 gupadhyaya

what will be the frequency for these critical apis in the monitoring system? if we can extend a bit more, we could also include

  • hmy_getTransactionByHash & hmyv2_getTransactionByHash - indicates tx exists in the blockchain
  • hmy_sendRawTransaction * hmyv2_sendRawTransaction - for normal transfers and any simple smart contract execution

It's totally up to us. Currently, critical APIs are checked once a minute. I will add the API method checks shortly.

givp avatar Jun 15 '21 19:06 givp

@LeoHChen I've added all the metrics as per your list. Please confirm https://status.harmony.one/ - Note all of these monitors have been automated and will reflect outages and recoveries.

Outstanding items:

  • Adding custom HTML link to Metrics page
  • Adding synthetic monitoring to analyze RPC method responses

givp avatar Jun 15 '21 23:06 givp

@givp we need also implement @gupadhyaya specific RPC test to make sure specific feature of the RPC are working fine.

@gupadhyaya would you have the actual test (what params to use) and the expected behavior ? with our recent issue was due to missing recent transaction, we might need to implement some logic to detect whether the RPCs are healthy or not

sophoah avatar Jun 16 '21 08:06 sophoah

@sophoah yes, that's what I'm working on right now. I'm going to use default parameters for all the methods from the docs to create the initial tests. We can then iterate and improve over time but I want to make sure we are getting back consistent data schemas for every test.

givp avatar Jun 16 '21 15:06 givp