fabric-operator icon indicating copy to clipboard operation
fabric-operator copied to clipboard

Doesn't re-queue after DNS lookup failure

Open jmcshane opened this issue 2 years ago • 1 comments

I deployed the orderers before the hosted zone started resolving. The operator didn't attempt to resolve the DNS domain and the orderers were stuck in a failing state:

  message: 'failed to reconcile node: Code: 21 - failed to initialize orderer node:
    ca is not reachable: pinging ''https://test-network-org0-ca-ca.org2.mydomain.com:443/cainfo''
    failed: health check request failed: Get "https://test-network-org0-ca-ca.org2.mydomain.com:443/cainfo":
    dial tcp: lookup test-network-org0-ca-ca.org2.mydomain.com on 10.43.0.10:53:

I understand the need to configure the DNS resolution, but the operator should be able to recover from this failure.

jmcshane avatar Jun 17 '22 15:06 jmcshane

Thanks - this is definitely something that should trigger a re-reconciliation - the orderers and peers need to be able to delay and/or retry the bootstrap CA initialization, eventually recovering from the error.

One item in the "feature backlog" for the peer and orderer CRDs is a caReference stanza to resolve a certificate authority by name rather than by explicit URL and TLS certificates, as currently hard-wired into the resource specs. With the current technique, peers and orderers can not be constructed UNTIL the CAs are active and the certificates have been extracted. A preferred approach will be to just have the peers and orderers reference a CA by name, delaying the reconciliation loop until the CAs are online. As a convenience, this means that peers, orderers, and CAs can be applied en masse in a single kustomization / kubectl application, and the operator will sort out the dependencies for the certificate bootstrapping. This should also help address the issue with DNS, as it will force the peers and orderers into a retry loop while the DNS entries are propagating.

jkneubuh avatar Jun 27 '22 11:06 jkneubuh