tmkms icon indicating copy to clipboard operation
tmkms copied to clipboard

HA setup

Open ctosae opened this issue 4 years ago • 9 comments

Following previous conversations it seems that this is the best way to try to deploy an "HA" configuration:

  • one TMKMS (Active) connected to multiple validatords in the same chain-id
  • keep a second TMKSM (Passive)

https://github.com/tendermint/tmkms/pull/272

Is it safe? Have any further developments or improvement been made?

ctosae avatar Apr 19 '21 17:04 ctosae

Is it safe?

It's received some degree of testing and some validators run in this configuration.

Have any further developments or improvement been made?

Not yet. We'll be switching to gRPC for validator <-> TMKMS connections soon (#73), after which TMKMS will track an "active" validator node and the others will be passive until the active validator fails.

After this migration is completed, we'll look into HA for TMKMS itself.

tony-iqlusion avatar Apr 19 '21 18:04 tony-iqlusion

I was also thinking about this solution, it seems safer to me even in case of any bugs.

TMKMS01+HSM --+> VALIDATOR +--> SENTRY1
TMKMS02+HSM --|            |--> SENTRY2
                           |--> SENTRY3

(TMKSM01 and TMKSM02 "state_file" are NOT in sync)

Since a VALIDATOR node accepts only one Tendermint connection (from an external PrivValidator process), this could be a way to create a redundancy for TMKMS.

A disaster recovery in case of VALIDATOR fault could be connecting one of two TMKMS to a SENTRY.

What do you think about it?

ctosae avatar May 18 '21 09:05 ctosae

The migration to gRPC reverses the direction of the connection, so validators connect to TMKMS rather than the other way around.

We'll likely want to deprecate/phase out the current "secret connection"-based approach.

tony-iqlusion avatar May 18 '21 14:05 tony-iqlusion

It's received some degree of testing and some validators run in this configuration.

I've setup 2 validators 1 tmkms, and From logs, I can see tmkms responding to both validators with same key. So looks like it is risky for now?!


testnet2  | 12:47PM INF committed state app_hash=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4 height=3413176 module=state num_txs=1
tmkms     | 2022-12-07T12:47:44.943746Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] received request: ShowPublicKey(PubKeyRequest)
tmkms     | 2022-12-07T12:47:44.943764Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] sending response: PublicKey(PubKeyResponse { pub_key_ed25519: [74, 21, 224, 140, 241, 58, 2, 66, 174, 235, 92, 12, 46, 136, 122, 138, 1, 185, 116, 106, 248, 39, 144, 141, 43, 121, 23, 2, 181, 84, 236, 248] })
testnet3  | 12:47PM INF commit synced commit=436F6D6D697449447B5B323234203437203320322031303120352032303820313320313836203139342038302034332032333120302032333620373320383520313931203235332039312032313920323035203337203634203238203135322031353420313831203136352031303220313830203138305D3A3334313442387D
testnet3  | 12:47PM INF committed state app_hash=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4 height=3413176 module=state num_txs=1
testnet2  | 12:47PM INF indexed block height=3413176 module=txindex
tmkms     | 2022-12-07T12:47:44.951079Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] received request: ShowPublicKey(PubKeyRequest)
tmkms     | 2022-12-07T12:47:44.951103Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] sending response: PublicKey(PubKeyResponse { pub_key_ed25519: [74, 21, 224, 140, 241, 58, 2, 66, 174, 235, 92, 12, 46, 136, 122, 138, 1, 185, 116, 106, 248, 39, 144, 141, 43, 121, 23, 2, 181, 84, 236, 248] })
testnet3  | 12:47PM INF indexed block height=3413176 module=txindex

tmkms     | 2022-12-07T12:47:48.277496Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] received request: ReplyPing(PingRequest)
tmkms     | 2022-12-07T12:47:48.277545Z DEBUG tmkms::session: [mantle-1@tcp://testnet2:1234] sending response: Ping(PingResponse)
tmkms     | 2022-12-07T12:47:48.284968Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] received request: ReplyPing(PingRequest)
tmkms     | 2022-12-07T12:47:48.285011Z DEBUG tmkms::session: [mantle-1@tcp://testnet3:1234] sending response: Ping(PingResponse)
testnet2  | 12:47PM INF Timed out dur=4912.38743 height=3413177 module=consensus round=0 step=1
testnet3  | 12:47PM INF Timed out dur=4917.999917 height=3413177 module=consensus round=0 step=1
testnet3  | 12:47PM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E","parts":{"hash":"8F16936C7F74ECFD600C14CDC7A2277812FF5CBEA4580A11F84E7C017757A60C","total":1}},"height":3413177,"pol_round":-1,"round":0,"signature":"NgrA5gmuOw812cIAGc2Ef4HGGmr8I4iZeZyRrft8HOl6DWgYa/SkSFnq+v6pp6j3196KdgLkHrScj7hd17M4BQ==","timestamp":"2022-12-07T12:47:49.860448507Z"}
testnet3  | 12:47PM INF received complete proposal block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus
testnet2  | 12:47PM INF received proposal module=consensus proposal={"Type":32,"block_id":{"hash":"D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E","parts":{"hash":"8F16936C7F74ECFD600C14CDC7A2277812FF5CBEA4580A11F84E7C017757A60C","total":1}},"height":3413177,"pol_round":-1,"round":0,"signature":"NgrA5gmuOw812cIAGc2Ef4HGGmr8I4iZeZyRrft8HOl6DWgYa/SkSFnq+v6pp6j3196KdgLkHrScj7hd17M4BQ==","timestamp":"2022-12-07T12:47:49.860448507Z"}
testnet2  | 12:47PM INF received complete proposal block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus
testnet2  | 12:47PM INF finalizing commit of block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus num_txs=0 root=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4
testnet3  | 12:47PM INF finalizing commit of block hash=D968C5B0F77373DF954F631565CF620471050CD5A8CCFFA80EE51984CD2D063E height=3413177 module=consensus num_txs=0 root=E02F03026505D00DBAC2502BE700EC4955BFFD5BDBCD25401C989AB5A566B4B4
testnet2  | 12:47PM INF minted coins from module account amount=72399106umntl from=mint module=x/bank
testnet2  | 12:47PM INF executed block height=3413177 module=state num_invalid_txs=0 num_valid_txs=0
testnet2  | 12:47PM INF commit synced commit=436F6D6D697449447B5B3237203231342031333120323232203239203137392031333020313036203133362039372038392031383820313833203638203131372031352031343420323033203133362031383220313139203133352034203138332032333320313934203539203133302031393920313434203735203234345D3A3334313442397D
testnet2  | 12:47PM INF committed state app_hash=1BD683DE1DB3826A886159BCB744750F90CB88B6778704B7E9C23B82C7904BF4 height=3413177 module=state num_txs=0
testnet3  | 12:47PM INF minted coins from module account amount=72399106umntl from=mint module=x/bank
testnet3  | 12:47PM INF executed block height=3413177 module=state num_invalid_txs=0 num_valid_txs=0

pratikbin avatar Dec 07 '22 12:12 pratikbin

@pratikbin it's intended and semi-supported to allow multiple concurrent validators. We don't recommend that but it's been tested and no one has reported problems yet.

In that case they're signing the same commit hashes. It's deliberately supported to be able to resign the exact same hash at the exact same h/r/s for fault tolerance purposes. The signature process is deterministic and this will result in the same signature on the same proposal, which doesn't count as double signing.

In the event multiple validators send conflicting proposals, the first validator will "win" and the other validator will receive a double signing error

tony-iqlusion avatar Dec 07 '22 19:12 tony-iqlusion

Hello,

I'm interested by running this kind of setup.

  • 1 tmkms runned by an orchestrator for redundancy
  • multiple validators node connected to tmkms for HA.

But i'm questioning for this architecture about the node_key.json .

Should i set 2 nodes with the same node_key

and have a config like:

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

or set different node_key

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

[[validator]]
chain_id = "cosmoshub-3"
addr = "tcp://[email protected]:26658"
secret_key = "/root/config/secrets/kms-identity.key"
protocol_version = "legacy"
reconnect = true

Any ETA on a HA status update ?

albttx avatar Feb 26 '23 11:02 albttx

@albttx AFAIK, It won't join p2p with same node_key since it's tendermint p2p key

pratikbin avatar Feb 27 '23 05:02 pratikbin

@tony-iqlusion is there any news about HA? Or could you review and support configurations like the previous ones? (from @albttx) I'm testing Horcrux for the first time and it do that.. TMKMS keeps closing the connection (prevent double-sign) Thanks

activenodes avatar Aug 18 '23 07:08 activenodes

We've largely been waiting for a migration to gRPC, which will reverse the client/server relationship between the KMS and validator nodes. Instead of having to explicitly configure several validators for the KMS to connect to, multiple validators can connect to the KMS.

That's tracked here: https://github.com/cometbft/cometbft/issues/476

tony-iqlusion avatar Aug 18 '23 13:08 tony-iqlusion