fabric
fabric copied to clipboard
Getting quorum loss issue as we rotate the orderer TLS certs in fabric 2.2
Hi Team,
I have 3 orderer nodes running on cloud, in k8s. From the cli I create a new certificate tls as well as MSP and store in the separate folder, with all Orderer still pointing to old certs. Then from cli, I fetch the config , encode, and ord0 tls cert, decode it and then push the update for syschannel as well as for application channel. Both the channel update operation return success for ord0. However, when I perform this for ord1 I am getting quorum loss issue.
2022-08-03 14:08:18.584 UTC [channelCmd] InitCmdFactory -> INFO 001 Endorser and orderer connections initialized Error: got unexpected status: BAD_REQUEST -- error applying config update to existing channel 'syschannel': consensus metadata update for channel config update is invalid: 2 out of 3 nodes are alive, configuration will result in quorum loss
The real issue here is that you have one orderer down out of three, and when you try to rotate the certificate of that orderer, the orderer validation logic doesn't take into account that you're rotating the certificate of the orderer that is down, and wrongly decides it will result in quorum loss.
@shivdeep-singh-ibm do you want to take a look at this?
@yacovm Sure.
https://discord.com/channels/905194001349627914/945037581500940298/1004380841973125150
The real issue here is that you have one orderer down out of three, and when you try to rotate the certificate of that orderer, the orderer validation logic doesn't take into account that you're rotating the certificate of the orderer that is down, and wrongly decides it will result in quorum loss.
@shivdeep-singh-ibm do you want to take a look at this?
I think this diagnosis is incorrect since the orderer validation code has logic in place which takes into account only alive orderers, to take a decision about quorum loss.
However, as seen from logs, where first orderer was not able to join the quorum as it din't find its own cert in the consenters, the probable causes may be: a) orderer 1 dint receive the new config block which has rotated certs and it is pointing to rotated certs. b) orderer 1 received the new config block which has rotated certs but it is pointing to original certs.
Please check the steps I am performing to arrive at this issue.
I have a network of 3 orderer(0,1,2) hosted via kubernetes on GCP. All the certificates are generated and kept in secret and configmap provided by kubernetes which is getting mapped to respective orderers. I create new msp and tls certs of all peers and orderers from cli. All orderer are still pointed to old certs only. I create a cli having core_peer_address as orderer0 and its msp. Post which I fetch config block decode it, add orderer0 new tls cert in consenter, encode it and published to channel. It gets successfully executed. I also perform same operation for application channel as well which also returns success.
After this I replace Orderer0 msp and its TLS cert (in secret and configmap provided by kubernetes). Now, I boot cli for orderer1 to perform same operation. But once I publish channel update, it return quorum loss issue.
Please revert if you have any concern. Or can you share the steps to rotate the certs. It is affecting my production network. And network certificates would be expiring soon.
I have a network of 3 orderer(0,1,2) hosted via kubernetes on GCP. All the certificates are generated and kept in secret and configmap provided by kubernetes which is getting mapped to respective orderers. I create new msp and tls certs of all peers and orderers from cli. All orderer are still pointed to old certs only. I create a cli having core_peer_address as orderer0 and its msp. Post which I fetch config block decode it, add orderer0 new tls cert in consenter, encode it and published to channel. It gets successfully executed. I also perform same operation for application channel as well which also returns success.
Also after doing config update , wait for sometime, for orderer0, to get the config blocks. After this step, we assume that the config update has been published, but we need to be sure. For debugging, can you fetch the config the channel config, decode it and check whether your new tls cert is present, for orderer0? We want to make sure that the config update has been committed to orderer0.
Please find the observation. After updating syschannel and application channel for ord0(It works fine). I replaced ord0 msp and tls cert. After that when I fetch syschannel config block from ord1 I am getting new tls cert of ord0. But when I am updating syschannel, getting same issue
2022-08-11 08:40:40.849 UTC 0001 INFO [channelCmd] InitCmdFactory -> Endorser and orderer connections initialized Error: got unexpected status: BAD_REQUEST -- error applying config update to existing channel 'syschannel': consensus metadata update for channel config update is invalid: 2 out of 3 nodes are alive, configuration will result in quorum loss
getting the below error in ord1
2022-08-11 08:52:48.193 UTC 01d7 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 1.089742ms with error: public key of server certificate presented by orderer0-hlf-ord.svc.cluster.local:7050 doesn't match the expected public key remoteaddress=10.80.43.165:7050
getting the below error in ord1
2022-08-11 08:52:48.193 UTC 01d7 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 1.089742ms with error: public key of server certificate presented by orderer0-hlf-ord.svc.cluster.local:7050 doesn't match the expected public key remoteaddress=10.80.43.165:7050
Can you pull a config block from orderer0 and check the tls keys in it?
Okay, for this. from ord0 once I push the channel update to syschannel and application channel, I cannot fetch the config block, It gives error Service Unavailable.
From ord1, I can fetch the config block. In this config block, I do see new tls cert of ord0.
Okay, for this. from ord0 once I push the channel update to syschannel and application channel, I cannot fetch the config block, It gives error Service Unavailable.
So your orderer0 is has some problem or some step may be missing. However you can go through this resource for a demo to rotate certs for fabric
It got solved. Since the correct public key was not presented by ord0. I needed to manually restart the ord0 to have new certs.
Thanks for the help.