fabric icon indicating copy to clipboard operation
fabric copied to clipboard

Getting quorum loss issue as we rotate the orderer TLS certs in fabric 2.2

Open shobhitJava opened this issue 3 years ago • 8 comments
trafficstars

Hi Team,

I have 3 orderer nodes running on cloud, in k8s. From the cli I create a new certificate tls as well as MSP and store in the separate folder, with all Orderer still pointing to old certs. Then from cli, I fetch the config , encode, and ord0 tls cert, decode it and then push the update for syschannel as well as for application channel. Both the channel update operation return success for ord0. However, when I perform this for ord1 I am getting quorum loss issue.

2022-08-03 14:08:18.584 UTC [channelCmd] InitCmdFactory -> INFO 001 Endorser and orderer connections initialized Error: got unexpected status: BAD_REQUEST -- error applying config update to existing channel 'syschannel': consensus metadata update for channel config update is invalid: 2 out of 3 nodes are alive, configuration will result in quorum loss

shobhitJava avatar Aug 04 '22 02:08 shobhitJava

The real issue here is that you have one orderer down out of three, and when you try to rotate the certificate of that orderer, the orderer validation logic doesn't take into account that you're rotating the certificate of the orderer that is down, and wrongly decides it will result in quorum loss.

@shivdeep-singh-ibm do you want to take a look at this?

yacovm avatar Aug 04 '22 08:08 yacovm

@yacovm Sure.

shivdeep-singh-ibm avatar Aug 04 '22 09:08 shivdeep-singh-ibm

https://discord.com/channels/905194001349627914/945037581500940298/1004380841973125150

yacovm avatar Aug 04 '22 12:08 yacovm

The real issue here is that you have one orderer down out of three, and when you try to rotate the certificate of that orderer, the orderer validation logic doesn't take into account that you're rotating the certificate of the orderer that is down, and wrongly decides it will result in quorum loss.

@shivdeep-singh-ibm do you want to take a look at this?

I think this diagnosis is incorrect since the orderer validation code has logic in place which takes into account only alive orderers, to take a decision about quorum loss.

However, as seen from logs, where first orderer was not able to join the quorum as it din't find its own cert in the consenters, the probable causes may be: a) orderer 1 dint receive the new config block which has rotated certs and it is pointing to rotated certs. b) orderer 1 received the new config block which has rotated certs but it is pointing to original certs.

shivdeep-singh-ibm avatar Aug 10 '22 11:08 shivdeep-singh-ibm

Please check the steps I am performing to arrive at this issue.

I have a network of 3 orderer(0,1,2) hosted via kubernetes on GCP. All the certificates are generated and kept in secret and configmap provided by kubernetes which is getting mapped to respective orderers. I create new msp and tls certs of all peers and orderers from cli. All orderer are still pointed to old certs only. I create a cli having core_peer_address as orderer0 and its msp. Post which I fetch config block decode it, add orderer0 new tls cert in consenter, encode it and published to channel. It gets successfully executed. I also perform same operation for application channel as well which also returns success.

After this I replace Orderer0 msp and its TLS cert (in secret and configmap provided by kubernetes). Now, I boot cli for orderer1 to perform same operation. But once I publish channel update, it return quorum loss issue.

Please revert if you have any concern. Or can you share the steps to rotate the certs. It is affecting my production network. And network certificates would be expiring soon.

shobhitJava avatar Aug 11 '22 02:08 shobhitJava

I have a network of 3 orderer(0,1,2) hosted via kubernetes on GCP. All the certificates are generated and kept in secret and configmap provided by kubernetes which is getting mapped to respective orderers. I create new msp and tls certs of all peers and orderers from cli. All orderer are still pointed to old certs only. I create a cli having core_peer_address as orderer0 and its msp. Post which I fetch config block decode it, add orderer0 new tls cert in consenter, encode it and published to channel. It gets successfully executed. I also perform same operation for application channel as well which also returns success.

Also after doing config update , wait for sometime, for orderer0, to get the config blocks. After this step, we assume that the config update has been published, but we need to be sure. For debugging, can you fetch the config the channel config, decode it and check whether your new tls cert is present, for orderer0? We want to make sure that the config update has been committed to orderer0.

shivdeep-singh-ibm avatar Aug 11 '22 07:08 shivdeep-singh-ibm

Please find the observation. After updating syschannel and application channel for ord0(It works fine). I replaced ord0 msp and tls cert. After that when I fetch syschannel config block from ord1 I am getting new tls cert of ord0. But when I am updating syschannel, getting same issue

2022-08-11 08:40:40.849 UTC 0001 INFO [channelCmd] InitCmdFactory -> Endorser and orderer connections initialized Error: got unexpected status: BAD_REQUEST -- error applying config update to existing channel 'syschannel': consensus metadata update for channel config update is invalid: 2 out of 3 nodes are alive, configuration will result in quorum loss

shobhitJava avatar Aug 11 '22 08:08 shobhitJava

getting the below error in ord1

2022-08-11 08:52:48.193 UTC 01d7 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 1.089742ms with error: public key of server certificate presented by orderer0-hlf-ord.svc.cluster.local:7050 doesn't match the expected public key remoteaddress=10.80.43.165:7050

shobhitJava avatar Aug 11 '22 08:08 shobhitJava

getting the below error in ord1

2022-08-11 08:52:48.193 UTC 01d7 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 1.089742ms with error: public key of server certificate presented by orderer0-hlf-ord.svc.cluster.local:7050 doesn't match the expected public key remoteaddress=10.80.43.165:7050

Can you pull a config block from orderer0 and check the tls keys in it?

shivdeep-singh-ibm avatar Aug 11 '22 10:08 shivdeep-singh-ibm

Okay, for this. from ord0 once I push the channel update to syschannel and application channel, I cannot fetch the config block, It gives error Service Unavailable.

From ord1, I can fetch the config block. In this config block, I do see new tls cert of ord0.

shobhitJava avatar Aug 11 '22 12:08 shobhitJava

Okay, for this. from ord0 once I push the channel update to syschannel and application channel, I cannot fetch the config block, It gives error Service Unavailable.

So your orderer0 is has some problem or some step may be missing. However you can go through this resource for a demo to rotate certs for fabric

shivdeep-singh-ibm avatar Aug 11 '22 12:08 shivdeep-singh-ibm

It got solved. Since the correct public key was not presented by ord0. I needed to manually restart the ord0 to have new certs.

shobhitJava avatar Aug 11 '22 12:08 shobhitJava

Thanks for the help.

shobhitJava avatar Aug 12 '22 09:08 shobhitJava