harmony
harmony copied to clipboard
new status page of all public services
Summary
New public status page of harmony blockchain and services
Current Design
We currently have a status page https://status.harmony.one/ to display the status of mostly the internal bootnode, validator nodes, and explorer nodes.
Problems
The current status page didn't capture all the public service like uptime of the RPC endpoints that may impact users. We need to improve it and also provide a single source of truth regarding the incidents and response.
Proposal
In the new status page, we shall display the uptime/availability of the following services on the mainnet. We may consider to add a similar page to display the status of the testnet later.
- bootstrap nodes uptime, checking the connectivity of the specific port
dig txt _dnsaddr.bootstrap.t.hmny.io
- uptime of all API RPC endpoints, checking the connectivity of the RPC port
api.harmony.one, api.s0.t.hmny.io
- uptime of the WSS endpoints, using WebSocket connectivity check
ws.s0.t.hmny.io
- explorer, the frontend and backend service to serve the https://explorer.harmony.one
- staking dashboard, the frontend and backend service to serve the https://staking.harmony.one
- graph nodes backend
- bridge service, the frontend, and backend service to serve https://bridge.harmony.one, both ETH and BSC bridges
- multi-sig service, the frontend, and backend service to serve https://multisig.harmony.one
Please add a link to the network metrics page as well. https://monitor.hmny.io/status
Reference
https://status.slack.com/
@LeoHChen as well as checking for endpoint uptime, what do you say we also use synthetic monitoring to inspect the response payload of key API methods to ensure data structure is valid?
Agree. It would be better to monitor the uptime/response time of a few key APIs. A more systematic way of monitoring RPC calls would need to add instruments to the node to keep track of the number and response time of all RPC calls. However, for now, we can just add a list of key APIs that we need to monitor.
@gupadhyaya , needs your input on which API we shall monitor in our status/dashboard?
There is already a request to track the response time of trace_block
coming from @hypnagonia , https://github.com/harmony-one/harmony/issues/3780
curl --request POST 'http://54.189.61.183:9500' --header 'Content-Type: application/json' --data-raw '{
"jsonrpc": "2.0",
"method": "trace_block",
"params": ["0xd6739e"],
"id": 1
}'
We need hmy_getTransactionReceipt
, hmyv2_getTransactionReceipt
and web socket subscription to Logs, which I think is calling hmy_getLogs
. These are the two keys APIs for bridge.
May be it is also worthwhile to add following APIs:
-
hmy_getTransactionsHistory
&hmyv2_getTransactionsHistory
- related to account page loading -
hmy_call
&hmyv2_call
- for smart contract calls.
what will be the frequency for these critical apis in the monitoring system? if we can extend a bit more, we could also include
-
hmy_getTransactionByHash
&hmyv2_getTransactionByHash
- indicates tx exists in the blockchain -
hmy_sendRawTransaction
*hmyv2_sendRawTransaction
- for normal transfers and any simple smart contract execution
what will be the frequency for these critical apis in the monitoring system? if we can extend a bit more, we could also include
hmy_getTransactionByHash
&hmyv2_getTransactionByHash
- indicates tx exists in the blockchainhmy_sendRawTransaction
*hmyv2_sendRawTransaction
- for normal transfers and any simple smart contract execution
It's totally up to us. Currently, critical APIs are checked once a minute. I will add the API method checks shortly.
@LeoHChen I've added all the metrics as per your list. Please confirm https://status.harmony.one/ - Note all of these monitors have been automated and will reflect outages and recoveries.
Outstanding items:
- Adding custom HTML link to Metrics page
- Adding synthetic monitoring to analyze RPC method responses
@givp we need also implement @gupadhyaya specific RPC test to make sure specific feature of the RPC are working fine.
@gupadhyaya would you have the actual test (what params to use) and the expected behavior ? with our recent issue was due to missing recent transaction, we might need to implement some logic to detect whether the RPCs are healthy or not
@sophoah yes, that's what I'm working on right now. I'm going to use default parameters for all the methods from the docs to create the initial tests. We can then iterate and improve over time but I want to make sure we are getting back consistent data schemas for every test.