operator
operator copied to clipboard
Tenant pods fail after certificate renewal: certificate signed by unknown authority
Our minio tenant pods fail regularly whenever the internal certificate was renewed by the operator:
Error: Post "https://miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local:9000/minio/peer/v37/backgroundhealstatus": tls: failed to verify certificate: x509: certificate signed by unknown authority (*rest.NetworkError)
It appears that minio doesn't process any changes made to the CA files while the certs were renewed.
In our specific deployment we've combined the requestAutoCert: true
setting with an externalCertSecret
which is an external cert-manager issued LetsEncrypt certificate that we use to achieve E2E encryption via a passthrough Ingress object. I'm not sure if this contributes to the issue.
When curl
ing the endpoint manually, you can see that it's already serving the renewed certificate (valid from Mar 28 16:51:34 2024 GMT):
$ curl -vik https://miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local:9000/
* Trying 10.129.4.20...
* TCP_NODELAY set
* Connected to miniotenant-ss-0-1.miniotenant-hl.cld-1225.svc.cluster.local (10.129.4.20) port 9000 (#0)
<...>
* Server certificate:
* subject: O=system:nodes; CN=system:node:*.miniotenant-hl.cld-1225.svc.cluster.local
* start date: Mar 28 16:51:34 2024 GMT
* expire date: Apr 15 23:29:11 2024 GMT
* issuer: CN=kube-csr-signer_@1710631750
* <...>
which matches the renewed CA within the container:
$ cat /tmp/certs/CAs/hostname-1.crt
-----BEGIN CERTIFICATE-----
MIIDfDCCAmSgAwIBAgIRAP7oVlZ1NuDmhsaLw4NVwAMwDQYJKoZIhvcNAQELBQAw
JjEkMCIGA1UEAwwba3ViZS1jc3Itc2lnbmVyX0AxNzEwNjMxNzUwMB4XDTI0MDMy
ODE2NTEzNFoXDTI0MDQxNTIzMjkxMVowWTEVMBMGA1UEChMMc3lzdGVtOm5vZGVz
MUAwPgYDVQQDDDdzeXN0ZW06bm9kZToqLm1pbmlvdGVuYW50LWhsLmNsZC0xMjI1
LnN2Yy5jbHVzdGVyLmxvY2FsMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEy70p
Zo4d0j7rBJGM0gwKt9oSqaG3M/38a+9RfwUMxb/N1dzGAEpyCyHfvQeQRP4C8wbZ
kXASmau3qH2GW25tLqOCATswggE3MA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAK
BggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMB8GA1UdIwQYMBaAFH807HJskwpoeu60
6qVPC/yTJva3MIHgBgNVHREEgdgwgdWCQm1pbmlvdGVuYW50LXNzLTAtezAuLi4x
fS5taW5pb3RlbmFudC1obC5jbGQtMTIyNS5zdmMuY2x1c3Rlci5sb2NhbIIgbWlu
aW8uY2xkLTEyMjUuc3ZjLmNsdXN0ZXIubG9jYWyCDm1pbmlvLmNsZC0xMjI1ghJt
aW5pby5jbGQtMTIyNS5zdmOCKyoubWluaW90ZW5hbnQtaGwuY2xkLTEyMjUuc3Zj
LmNsdXN0ZXIubG9jYWyCHCouY2xkLTEyMjUuc3ZjLmNsdXN0ZXIubG9jYWwwDQYJ
KoZIhvcNAQELBQADggEBAKb8FZZ8qewUtzmGGVlOMnZnJN064Nq2RWoNqNHz2mHz
JyabvVGD/ogLKbN7rKNkWfnZSzTsZv9OFjSmpEQkTq1duuKPDWxxdu/g3AVD6uiJ
Dy3WtTAKUTKugGCzt0Vv9WfEawvtoYGvJVFRg8MPbEvct9CugGdiXrDjUOJDh3DK
ABI5NawwsgPfqy8XsdMaDnLevh9mvDyQmWOSzw6Z0MftRxucXnc0YDsHPWXEG3TK
70a2yQJWttpKIQPpS5oEj4lxirum1BqRjIuS4pO9XXlo21RhQDGES26siSVBYt/f
WPAZJeRO5uuJu2h6qGL5BA1UH/op5Z8V/ELXA9rj6l4=
-----END CERTIFICATE-----
---
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
fe:e8:56:56:75:36:e0:e6:86:c6:8b:c3:83:55:c0:03
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=kube-csr-signer_@1710631750
Validity
Not Before: Mar 28 16:51:34 2024 GMT
Not After : Apr 15 23:29:11 2024 GMT
Subject: O=system:nodes, CN=system:node:*.miniotenant-hl.cld-1225.svc.cluster.local
---
I suspect, this is the CA that should be used by minio, however it seems like minio is still using an outdated one from memory to verify requests to its cluster members, as the error message tls: failed to verify certificate: x509: certificate signed by unknown authority
suggests. It looks like the renewed certificate file itself was read from disk, but the CA file wasn't.
Of course, the /tmp/certs/CAs/
directory also contains the Root CA of the Letsencrypt authority R3 (externalCertSecret
), but that'll only become an issue when that specific certificate is renewed in a couple of weeks. So we'll ignore it for now.
Expected Behavior
Minio should automatically process certificate renewals and all tenant pods must process changes made to the CA files as well.
Current Behavior
This issue can be resolved temporarily (until next certificate renewal) by restarting the minio service with the mc
CLI:
mc admin service restart <tenant>
This proves that this isn't an issue with the certificates themselves, but with minio not processing the renewed certs correctly, as this command doesn't alter the certificate files at all. They are probably just re-read from disk when the service is restarted.
Possible Solution
n.a.
Steps to Reproduce (for bugs)
- Spin up a minio tenant using this yaml:
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
labels:
app.kubernetes.io/component: miniotenant
app.kubernetes.io/instance: miniotenant
app.kubernetes.io/name: miniotenant
name: miniotenant
namespace: cld-1225
scheduler:
name: ''
spec:
requestAutoCert: true
exposeServices:
console: false
minio: false
serviceAccountName: miniotenant-sa
users:
- name: miniotenant-user-1
imagePullSecret: {}
imagePullPolicy: IfNotPresent
configuration:
name: miniotenant-env-configuration
pools:
- affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: v1.min.io/tenant
operator: In
values:
- miniotenant
topologyKey: kubernetes.io/hostname
name: ss-0
resources:
limits:
cpu: 50m
memory: 400Mi
requests:
cpu: 2m
memory: 200Mi
servers: 2
volumeClaimTemplate:
apiVersion: v1
kind: persistentvolumeclaims
metadata: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: csi-rbd-sc
status: {}
volumesPerServer: 2
podManagementPolicy: Parallel
image: 'minio/minio:RELEASE.2024-02-17T01-15-57Z'
features:
domains:
console: 'https://minio-console-cld-1225.apps.<openshift-cluster>.com'
minio:
- 'https://minio-cld-1225.apps.<openshift-cluster>.com'
enableSFTP: false
mountPath: /export
externalCertSecret:
- name: miniotenant-certificate-secret-tls
type: kubernetes.io/tls
status:
usage:
capacity: 2055798784
rawCapacity: 4294967296
rawUsage: 70713344
usage: 70713344
availableReplicas: 2
healthMessage: Service Unavailable
healthStatus: red
provisionedUsers: true
pools:
- legacySecurityContext: false
ssName: miniotenant-ss-0
state: PoolInitialized
currentState: Initialized
drivesOffline: 2
revision: 0
certificates:
autoCertEnabled: true
customCertificates:
minio:
- certName: miniotenant-certificate-secret-tls
domains:
- minio-cld-1225.apps.<openshift-cluster>.com
- minio-cld-1225.apps.<openshift-cluster>.com
- minio-console-cld-1225.apps.<openshift-cluster>.com
expiresIn: '72 days, 22 hours, 55 minutes, 32 seconds'
expiry: '2024-06-13T13:58:12Z'
serialNo: '426110582089806621805261802254366324527450'
- certName: miniotenant-certificate-secret-tls
domains:
- R3
expiresIn: '532 days, 0 hours, 57 minutes, 20 seconds'
expiry: '2025-09-15T16:00:00Z'
serialNo: '192961496339968674994309121183282847578'
drivesOnline: 2
syncVersion: v5.0.0
writeQuorum: 3
- Wait until the certificate and CA is renewed by the Operator
- Check log for requests to other cluster members, they should fail with the mentioned error message, also the Tenant's object
/status/healthMessage
isService Unavailable
Context
Regression
Your Environment
- Version used (
minio-operator
): minio-operator v5.0.13 - Environment name and version (e.g. kubernetes v1.17.2): Red Hat OpenShift v4.12
- Server type and version: minio/minio:RELEASE.2024-02-17T01-15-57Z
- Operating System and version (
uname -a
): n.a. - Link to your deployment file: see "Possible Solution" above
AR @cniackz please test this in openshift and see if you can reproduce; then talk to our expert @pjuarezd and get some advice/help.
Also maybe this can be of help:
https://github.com/minio/operator/pull/1971 https://github.com/minio/operator/pull/1973
because https://github.com/minio/operator/blob/master/docs/cert-manager.md#create-operator-ca-tls-secret is as per design a manual thing we need to do on every renewal...
Hey guys, I have an idea. Why are you using requestAutoCert: true
here? Based on my testing, when using cert-manager
, you should disable it:
Check this out:
spec:
## Disable default TLS certificates.
requestAutoCert: false
Could you please try disabling it and use our example, or something similar? This will still require manual steps while performing this process, but at least you won't rely on Operator certificates anymore, only on cert-manager. Once we have a working solution for the rotation, this shouldn't cause any further problems.
Also, if you get a chance, please try the ideas from the following PRs and let us know if they work for you in OpenShift:
- https://github.com/minio/operator/pull/1971
- https://github.com/minio/operator/pull/1973
Hi @cniackz,
the reason we are merging the externalCertSecret
(issued by a cert-manager instance) and requestAutoCert
(issued by the MinIO Operator) is because you cannot request certificates for cluster-internal domain names (.svc.cluster.local
) via cert-manager.
"The certificate request has failed to complete and will be retried: Failed to wait for order resource "minio-console-cld-1225.apps.<openshift-cluster>.com-kwmf5-4021417717" to become ready: order is in "errored" state: Failed to create Order: 400 urn:ietf:params:acme:error:rejectedIdentifier: Error creating new order :: Cannot issue for "*.cld-1225.svc.cluster.local": Domain name does not end with a valid public suffix (TLD) (and 2 more problems. Refer to sub-problems for more information.); subproblems:\n\turn:ietf:params:acme:error:malformed: [dns: *.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)\n\turn:ietf:params:acme:error:malformed: [dns: *.minio.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)\n\turn:ietf:params:acme:error:malformed: [dns: *.miniotenant-hl.cld-1225.svc.cluster.local] Error creating new order :: Domain name does not end with a valid public suffix (TLD)
So in order to get valid certificates for inter-pod communication (meaning between multiple MinIO cluster member pods) we need requestAutoCert: true
.