troubleshoot
troubleshoot copied to clipboard
Collector & Analyzer for certificates
Describe the rationale for the suggested feature.
Several support cases have come through where certificates have expired, we would like to be able to explain this at the analyzer stage so folks know immediately what to do about it.
Describe the feature
Ensure that the certificate expiry information is available in support bundles for:
- Kubernetes
- weave
- kURL registry
- Envoy
Create an analyzer that will highlight expired certificates from any of the above, along with a URL that will show instructions for remedy.
Describe alternatives you've considered
.
Additional context
Related Shortcut: https://app.shortcut.com/replicated/story/42738/add-collectors-and-analyzers-for-expired-certificates
Envoy: https://community.replicated.com/t/kurl-how-to-manually-rotate-an-expired-certificate-for-envoy/889
kURL Registry: https://community.replicated.com/t/what-to-do-if-the-kurl-registry-certificates-have-expired/955
Contour: https://community.replicated.com/t/kots-is-it-possible-to-configure-tls-certs-for-contour-in-the-embedded-installation/390/2
Checking the Kubernetes certs themselves is more difficult as it requires sudo kubeadm certs check-expiration which requires sudo.
Definition of done
- [x] Write a spec for the collectors and analysers - Done here
- [x] Develop an in-cluster collector
- [x] Certs from secrets
- [x] Certs from config maps
- [ ] Certs from endpoints
- [x] In-cluster kubernetes cluster certs - most are in
kube-*config maps
- [x] Develop a host collector
- [x] Kubernetes cluster certs from host paths i.e
/etc/kubernetes/pki,/var/lib/kubelet/pkion a kubeadm-based cluster - [ ] Certs from endpoints
- [x] Certs from a list of paths in the host e.g
/etc/myservice/*.crt
- [x] Kubernetes cluster certs from host paths i.e
- [x] Develop an analyser - reviews expiry amongst other things
- [ ] The default specs we provide in troubleshoot-specs including analyzers for the above certificates with links to docs to fix them. Collect the following
- [ ] Kubernetes (host & in-cluster collectors)
- [ ] kURL registry
- [ ] Contour/envoy
- [ ] KOTS Admin Console (kotsadm-tls)
- [x] Document both the collector and analyser in troubleshoot.sh docs
Suggestion:
- write a new 'certificates' collector which hunts for all the certificates we mention above, and dumps some information about them including expiry. If a cert isn't found, move on rather than error.
- write a new analyzer 'certificates' which checks validity and expiry by default (just need to include it to get all the certs with info collected above checked)
- There's a host collector for certificates, maybe add to that for k8s certificate info
Alternate suggestion:
- Write a new generic collector which, when provided an endpoint, connects (https?) and reads the cert info. Store the expiry date and maybe some basic metadata in the output.
- Write an analyzer that checks that output, and writes out a warning for certs that expire within X days
Note that with coredns, we can reference endpoints by name within the cluster - so we can launch a pod and connect to the endpoints from there. This won't work for k8s certificates themselves (since we can't launch pods if a cluster isn't working) but that should be a host collector anyway.
The deciding point about how we move forward is likely to be how certificates for services are stored - if they're secrets, then we could reasonably access them with a search and program the logic into the collector. If not, we might need to supply a useful list of endpoints to collect from.
Next steps:
- Analysis of where the certs for various services are stored, and how we could access them from a collector
- Write up a more detailed requirements doc for the collector and analyzer based on that info
Certificate storage locations
Certificates in a k8s environment are usually stored in 3 locations. Below are these locations and options of how troubleshoot can collect metadata from them
- Directory in the nodes
- In cluster: Run pod that mounts a
hostPathvolume to access certificates. May require more privileges - Out of cluster: Direct access to files
- In cluster: Run pod that mounts a
- In k8s secrets
- Requires adequate RBAC
- Mapping of fields in the secret's data. Some secrets can be JSON, YAML, plain PEM file etc.
- Embedded in an image
- Exec into a running container
In all of the above scenarios, we can connect to endpoints and pull certs out
Troubleshoot design spec
For kURL cluster the certificates we would like to collect metadata from are as follows
- Kubernetes services (etcd, api-server...)
- All certificates are stored in one location (kubeadm default:
/etc/kubernetes/pki). The cert directory path can be read fromkubectl -n kube-system get configmaps kubeadm-config -o yaml - As a host collector, extract certificate data from on-disk paths such as kubelet configurations. This should help diagnose whether a kubelet's certificate is valid. In some cases we might want to check the private keys as well. That's where TLS certificate collector comes into play
- All certificates are stored in one location (kubeadm default:
- kURL registry
- Query certificate from
registry.kurl:443 - Read
kubectl describe secrets registry-pki -n kurlsecret
- Query certificate from
- Contour & envoy
- Query certificate from
envoy.projectcontour:443 - Read
kubectl describe secrets envoycert -n projectcontour - Read
kubectl describe secrets contourcert -n projectcontour
- Query certificate from
- kURL proxy
- Read
kubectl describe secrets kotsadm-tls
- Read
- Prometheus
- Query
prometheus-adapter.monitoring:443- This is a self-issued cert where the CA is in the container.
- Query
- Additional endpoints which we can connect to and pull out certs
Collector
To collect certificate metadata from the above services, and any other additional ones, we propose creating a new certificates collector instead of modifying existing ones.
- What (at minimum) to collect
- Subject
- Issuer
- NotAfter/NotBefore (used to tell if the cert is valid)
- Spec
apiVersion: troubleshoot.sh/v1beta2
...
spec:
collectors:
certificates:
excludeClusterCerts: false # true to ignore k8s cluster certs (api server, etcd ...) default (false)
endpoints: # List of endpoints to pull TLS certs from for metadata extraction
- registry.kurl:443
- my-app.namespace:443
-
Implementation considerations/notes
- K8s cluster certs will be a
best efforttype of operations where the collector tries to extract certificate metadata from all possible locations i.e/etc/kubernetes/pki, kubernetes secrets, pulling from endpoints etc. Not all clusters are kURL clusters let alonekubeadm(is this true??). - Preference is given to pulling certs from endpoints. This should be the most accurate source of truth.
- Kubernetes clusters stood up using kubeadm can utilise the kubeadm APIs to fetch certificates. In cluster, this would need to run in a pod that mounts the relevant
hostPaths. We need to check the stability of the API
- K8s cluster certs will be a
-
Output results of the collector stored in
/certificates.json
// List of certificate metadata
[
{
"source": {
"endpoint": "registry.kurl:443", // omit endpoint fields if cert not pulled from endpoint
"endpointReachable": false,
"certificatePath": "/path/to/my/cert.crt", // omit path if cert not pulled from path
},
"errors": [], // List of errors that occurred when running the collector e.g host path not found, connection errors...
"certificateChain" [ // Chain of certificates in order of their depth. The first one (0) should be the leaf cert.
{
"version": "",
"serialNumber": "",
"subject": "CN = registry.kurl.svc.cluster.local",
"issuer": "CN = kubernetes",
"notBefore": "2021-03-04T18:00:00Z",
"notAfter": "2021-06-02T18:00:00Z",
"isCA": false,
"subjectAlternativeNames": ["name1", "name2", "name3"],
},
....
]
},
{
...
}
]
If multiple certificates collectors are defined in a spec, append certificates to the list that would have been created prior
Analysers
- The certificate analyser will check the following, at the list
- Expired certificates
- Unreachable endpoints
- Spec
apiVersion: troubleshoot.sh/v1beta2
...
spec:
analyzers:
- certificates: # Iterate through list of certificates
outcomes:
- pass:
when: ""
message: ""
- warn:
when: "notAfter < TODAY + 15 days"
message: "<Subject name> certificate is about to expire"
- fail:
when: "notAfter < TODAY"
message: "<Subject name> has expired"
Redactor
- Add a redactor to remove certain fields from the certificate metadata e.g subjectAlternativeNames which will contain IP addresses and hostnames.
- Provide a list to map which certificate sources to be redacted. This list is a mapping of what we have in
endpointandcertificatePathfields
Other bits to note
- Raw CERT files are not being collected intentionally. They are treated as private data like IP addresses. We could consider making that opt-in in future
- Some of the functionality of TLS certificate host collector can be incorporated here but that is probably out of scope. The collector performs more checks such as comparing that the private key matches the public key embedded in the X509 certificate.
UPDATES:
- Added
source,versionandserialNumberfields in collected certificate metadata. - Added a spec for a redactor. Perhaps this can be done in a different GH issue
UPDATES-2:
- Added certificate chain representation in the collector output to capture intermediate certificate metadata as well. This is important cause verification works up a certificate chain all the way to the root cert. Seeing intermediate certificates would be necessary.
Definition of done
To complete the work as per the proposed spec we need to
- [ ] Develop an in-cluster collector (spec & implementation)
- [ ] Develop an analyser (spec & implementation)
- [ ] Document both the collector and analyser in troubleshoot.sh docs
One other customer issue seen recently is that the secret with a TLS pair for ingress, named ingress-auto-tls, was unable to be verified against the corporate CA. If that's something we can confirm with this it would be amazing.
Some other issue came our way where contour certificates are invalid
After reviewing issue notes again, working on this solution, and receiving feedback from previous PRs, I wanted to give an update on my perspective on what we should do to solve this issue.
I view addressing this issue as an opportunity to collect and validate both k8s hosts certificates that are critical to the operation & use of a k8s cluster and SSL certificates that pods utilize to satisfy encrypted communications.
From my perspective, there are four certificate validation use cases (see below) that I think we satisfy. If we cannot validate, then at least surface the information so it can be easily reviewed if required during the troubleshooting process.
Certificate Validation Use Cases:
-
Certificate-Key is a valid pair (ie: Compare output of each below: openssl rsa -noout -modulus -in apiserver.key | openssl md5 #key openssl x509 -noout -modulus -in apiserver.crt| openssl md5 #certificate)
-
Verify that a certificate is issued by a CA (ie: openssl verify -verbose -CAfile cacert.pem certName.crt)
-
Check expiration date (ie: cat /etc/kubernetes/pki/certName.crt | openssl x509 -noout -enddate
-
CN/Hostname verification against SSL certificate (ie: openssl s_client -connect url:443 -servername url)
Additionally, I believe that we should NOT capture and save actual certificates in support-bundles as I consider them private data that should not exit customer boundaries.
I envision two collectors:
- In-Cluster SSL Collector – collect/analyze SSL certificates located in secrets or configMaps.
- Collects SSL certificates utilized in pods.
- K8s host certificates:
- Collect/analyze controller node k8s certificates that are located in the /etc/kubernetes/pki directory and sub directories.
- Collect /analyze load balancer SSL certificates that front end multi-controller k8s clusters (end user configurable via the support-bundle spec).
### In-Cluster SSL Certificate Collector:
Purpose: Collect SSL certificates that pods utilize for encrypted communications.
I think the support-bundle spec should be as follows:
spec: collectors:
- inclustercertcollector: configOption
ConfigOptions:
- {}: entire cluster
- namespace: a named namespace
- secret: a named secret
- configmap: a named configMap
Allowing collection configuration options will provide the end-user collection option flexibility based on their needs.
Solution:
- Certificate-Key is a valid pair:
- A collector (HostCertificate) already exists; I propose adding the code into this collector and deprecating the HostCertificate collector. This will allow us to provide an aggregated data set for each ssl certificate in a Json that can easily be viewed and/or validated with an analyzer.
- Verify that a certificate is issued by a CA:
- Leverage Openssl Golang module or exec.Command()?
- Compare certificate and key and collect isValid.
- Collect errors if any.
- Check expiration date::
- Parse secret.certificateName.crt.certificate.NotAfter.
- Collect errors if any.
- CN/Hostname verification against SSL certificate:
- It would be ideal to leverage the Golang “crypto/tls” module to validate the hostname
- examples:
- conn.VerifyHostname("in-cluster-FQDN")
- conn.ConnectionState().PeerCertificates)
The challenges I see to this approach is that we require access to in-cluster DNS I see three ways to accomplish this:
a. Expose in-cluster DNS externally – This is not an option as this is a customer owned decision.
b. Launch a runPod and run a bash script to collect information and write out to file; collector (go code) to parse file & process content – awkward; not ideal.
c. Build a “Go” Pod that has the ability to run the support-bundle binary. (this is a great option but will take more planning and build time; and have GAP tactical considerations to manage through. Question: Are there any other options that I did not mention?
Of the three options articulated above, I think #3 is the most viable but will take time and planning; we also have GAP install considerations to account for. So in the spirit of providing a solution sooner than later, I propose that we just collect and store CN/hostname information so it can be visually expected.
Certificate Struct
type sslCert struct {
CertName string json:"Certificate Name"
DNSNames []string json:"DNS Names"
IssuerCommonName string json:"Issuer"
Organizations []string json:"Issuer Organizations"
CertDate time.Time json:"Certificate Expiration Date"
CertDateIsValid bool json:"CertDateIsValid"
CAIsValid bool json:"CAIsValid"
CrtKeyPairIsValid bool json:"CrtKeyIsValid"
Location location json:"Location,omitempty"
ErrorMap map[string]string json:"ErrorMap,omitempty"
}
Results for collected SSL certificates to be returned into a single Json file.
Summary: This solution will have the ability to surface information for in-cluster SSL certificates for the entire cluster ( {} ), named namespace, or named secret/configMap. Furthermore, it will provide information that will satisfy 3 of the 4 certificate validation use cases mentioned above. For the unsolved use case (CN/Hostname verification), the collector will collect the CN/Hostname information so it can be visually viewed and inspected if required when troubleshooting.
### K8s Host Certificate Collector:
Purpose: Collect all controller node certificates that are utilized for operating the k8s cluster. This gives the end-user a holistic view of the k8s certificate status for a node. This solution also includes the ability to validate load balancers SSL certificates that serve as a front end to a multi-controller k8s cluster by specifying url and port in the support-bundle spec.
I think the support-bundle spec should be as follows:
Spec: hostCollectors:: - k8snodecertcollector: url: example.com port: port#
Note:
- url: host FQDN
- port: host port number
Solution:
- Certificate-Key is a valid pair:
- A collector (HostCertificate) already exists; I propose adding the code into this collector and deprecating the HostCertificate collector. This will allow us to provide an aggregated data set for each ssl certificate in a Json that can easily be viewed and/or validated with an analyzer.
- Collect errors if any
- Verify that a certificate is issued by a CA:
- Leverage Openssl Golang module or exec.Command()?
- Compare certificate and key and collect isValid.
- Collect errors if any.
- Check expiration date::
- Parse secret.certificateName.crt.certificate.NotAfter).
- Collect errors if any.
- CN/Hostname verification against SSL certificate:
- Leverage the Golang “crypto/tls” module to validate load balancer SSL certificate information that fronts a k8s cluster.
- Collect errors if any.
- examples:
- conn.VerifyHostname("in-cluster-FQDN")
- conn.ConnectionState().PeerCertificates)
Certificate Struct
type sslCert struct {
CertName string json:"Certificate Name"
HostURL string json:"HostURL,omitempty"
HostPort int json:"HosPort,omitempty"
DNSNames []string json:"DNS Names"
IssuerCommonName string json:"Issuer"
Organizations []string json:"Issuer Organizations"
CertDate time.Time json:"Certificate Expiration Date"
CertDateIsValid bool json:"CertDateIsValid"
CAIsValid bool json:"CAIsValid,omitempty"
CrtKeyPairIsValid bool json:"CrtKeyIsValid,omitempty"
HostIsValid bool json:"HostIsValid,omitempty"
Location location json:"Location,omitempty"
ErrorMap map[string]string json:"ErrorMap,omitempty"
}
Results for collected host certificates to be returned into a single Json file.
Summary: This solution provides the ability to surface information for all certificates that are used to operate the k8s cluster for a specific node; this is accomplished by scraping the /etc/kubernetes/pki directory and sub directories. For these certificates, we will satisfy 3 of the 4 validation use cases. For the 4th use case (CN validation) the CN/Hostname will be collected to provide the ability to visually inspect in a support-bundle.
This solution also includes the end-user option to add a FQDN/Port in the support-bundle spec to validate the load balancer ssl certificates. It will validate 3 or the 4 use cases for load balancers that front-end a multi-controller k8s cluster. Validation of a certificate key is not applicable from the front end (unless there is something I don’t know?).
@xavpaice / @banjoh, please let me know your thoughts.
I like the json for cert info, that's the useful info and all we should need. Having the '*valid' keys means the collector can do the validation we want without needing an analyzer to read the cert, since they're private.
Regards collecting the entire cluster - that's possibly an issue in that we need to understand what to search for. How does that work? I can see what to collect if we have a secret name, I don't understand how to effectively read all the cluster secrets and hunt for certs (if that's what you mean).
I don't understand what you're planning to do with the HostCertificate collector?
Let's review the original need for this, and understand if the proposed MVP solves it:
- certs keep expiring for k8s, Weave, kURL registry and Envoy
The certs for those are stored either in /etc/ or in secrets. Since those are two quite different places, let's pick one.
I suggest a minimum initial collector therefore:
- reads the secrets if present for kURL registry, Contour & envoy, kURL proxy, Prometheus
- generates the json from those certs
And an analyzer:
- read the json
- highlight if the 'valid' bools are not healthy
- highlight if the cert is close to expiry
The host certs can be a separate collector/analyzer (let's write that up as a different issue with the summarized learnings from this one).
Thoughts?
@xavpaice and I just met to collaborate on this issue. Xav has requested that I follow this direction:
1 - Complete the in-cluster certificate collector; collect via secrets and parse certificate data.
2 - Do not collect SSL certificates; just pull back parsed data that we require.
3 - Collector should only collect named certificates that are contained in secrets.
4 - Forgo collecting certificates cluster-wide and named namespace for now (we can revisit later).
5 - Create a new issue for the k8s certificate host collector and deploy that separately (focus on the in-cluster first).
6 - Deploy in the following order:
- in-cluster certificate collector (named only)
- k8s certificate collector (host collector)
- deploy the analyzer (note: we will only need one analyzer)
@xavpaice please let me know if there is anything I missed or needs to be corrected. Thank you!
- PR to support following support yaml: https://github.com/replicatedhq/troubleshoot/pull/1119
(Ready to Review)
spec:
collectors:
- certificates:
secrets:
- name: envoycert
namespaces:
- kube-system
- projectcontour
configMaps:
- name: envoycert
namespaces:
- kube-system
- projectcontour
- PR to to support following analyzer yaml: https://github.com/replicatedhq/troubleshoot/pull/1128
(Ready to Review)
analyzers:
- certificates: # Iterate through list of certificates
outcomes:
- pass:
message: "certificate is valid"
- warn:
when: "notAfter < Today + 15 days"
message: "certificate is about to expire"
- fail:
when: "notAfter < Today"
message: "certificate has expired"
For Host certificate collector, we already have one to check certificate key pair. To not break the current code, I suggest the yaml should be like
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: certificate
spec:
hostCollectors:
- certificate:
certificatePath: /etc/ssl/corp.crt
keyPath: /etc/ssl/corp.key
- certificate:
certificatePath: /etc/kubernetes/pki/apiserver-etcd-client.crt
hostAnalyzers:
- certificate:
outcomes:
- fail:
when: "key-pair-missing"
message: Certificate key pair not found in /etc/ssl
- fail:
when: "key-pair-switched"
message: Cert and key pair are switched
- fail:
when: "key-pair-encrypted"
message: Private key is encrypted
- fail:
when: "key-pair-mismatch"
message: Cert and key do not match
- fail:
when: "key-pair-invalid"
message: Certificate key pair is invalid
- pass:
when: "key-pair-valid"
message: Certificate key pair is valid
- pass:
when: "certificated-not-expired"
message: Certificate is not expired
- warn:
when: "certificated-about-expire-in-15-days"
message: Certificate is about to expire in 15 days
- fail:
when: "certificated-expired"
message: Certificate is expired
SInce the private keys is not quite useful as cert meta data
I am proposing to use this format for the host certificate collector
Support Bundle Yaml
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: certificate
spec:
hostCollectors:
- certificate:
certificatePath:
- ~/apiserver-kubelet-client.crt
- /etc/ssl/corp.crt
- ~/ca.crt
hostAnalyzers:
- certificate:
outcomes:
- pass:
message: Certificate is valid
- warn:
when: "notAfter < Today + 365 days"
message: Certificate is about to expire
- fail:
when: "notAfter < Today"
message: Certificate is expired
Result Json
[
{
"name": "host.cerfiticates.verification",
"labels": {
"desiredPosition": "1",
"iconKey": "",
"iconUri": ""
},
"insight": {
"name": "host.cerfiticates.verification",
"labels": {
"iconKey": "",
"iconUri": ""
},
"primary": "Host Cerfiticates Verification",
"detail": "Certificate is expired, obtained from /Users/dexteryan/dev/replicated/troubleshoot/apiserver-kubelet-client.crt",
"severity": "error"
},
"severity": "error",
"analyzerSpec": "",
"error": "Certificate is expired, obtained from /Users/dexteryan/dev/replicated/troubleshoot/apiserver-kubelet-client.crt"
},
{
"name": "host.cerfiticates.verification",
"labels": {
"desiredPosition": "1",
"iconKey": "",
"iconUri": ""
},
"insight": {
"name": "host.cerfiticates.verification",
"labels": {
"iconKey": "",
"iconUri": ""
},
"primary": "Host Cerfiticates Verification",
"detail": "Certificate is expired, obtained from /Users/dexteryan/dev/replicated/troubleshoot/ca.crt",
"severity": "error"
},
"severity": "error",
"analyzerSpec": "",
"error": "Certificate is expired, obtained from /Users/dexteryan/dev/replicated/troubleshoot/ca.crt"
}
]
We need to document the work that is currently available so folks can use it.
SInce the private keys is not quite useful as cert meta data
I am proposing to use this format for the host certificate collector
I think we should not modify the existing TLS collector. If we find that we do not need it, we will follow the a deprecation process.
@DexterYan
Thank you! I have updated #1132 PR to add back the existing TLS collector and split the new one separately.
@banjoh can we get a fresh summary of the remaining work to close this issue?
I have updated tasks in the Definition of done. In summary, the missing pieces of work are
- Collector (in-cluster & host) for endpoints i.e pulling certs from a URL
- Adding specs to our default specs
It's unlikely we'll invest more in this right now, closing till there's more demand.