nifikop
nifikop copied to clipboard
[Feature/Certificate] Restart node when certificate updated
Q | A |
---|---|
Bug fix? | no |
New feature? | yes |
API breaks? | no |
Deprecations? | no |
License | Apache 2.0 |
What's in this PR?
This PR aims at creating a new field in the struct NodeState called CertificateExpireDate, and then automate the Nifi cluster rolling upgrade when the certificate is renewed (handle by the cert manager).
The idea is that we will give the field CertificateExpireDate the value of the current certificate renewal date, during each reconcile we will check if the CertificateExpireDate field has the same value as the current certificate renewal date, if no, we will do a rolling upgrade of the cluster and then assign the new renewal date to each node.
Why?
When the certificate handle by the cert manager was renewed it was necessary to restart the Nifi nodes manually to be able to access the UI.
Additional context
Checklist
- [x] Implementation tested
- [x] Error handling code meets the guideline
- [x] Logging code meets the guideline
- [x] User guide and development docs updated (if needed)
- [x] Append changelog with changes
ToDo
- [x] Test implementation after rebase
Have you tried out the certificate auto-reload feature that NiFi now supports out of the box?
Link
If you set nifi.security.autoreload.enabled
in nifi.properties
, any changes to the JKS keystore get auto-loaded without a restart.
Admittedly, I've not tried this in a k8s environment though - so might not be as slick as advertised here!
It might be worth testing out that workflow, before getting the operator to do the cert management via this PR?
Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535
I never heard about that so yes I'll try to use the feature before going further with this PR.
Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535
I never heard about that so yes I'll try to use the feature before going further with this PR.
If cert-manager automatically renews certificates and the configured Secrets get updated, then Kubernetes will automatically update the volumes mounted to the running Pods:
https://kubernetes.io/docs/concepts/configuration/secret/#mounted-secrets-are-updated-automatically
If this is the case, then we might be able to take advantage of that feature of NiFi automatically. There's one condition where Kubernetes will not automatically update Secrets mounted as volumes in Pods when a mount subPath
is used. I double checked and made sure nifikop doesn't set this:
https://github.com/konpyutaika/nifikop/blob/88743195e24bbf7d0c02ab3a238dd2dfb01a544b/pkg/resources/nifi/pod.go#L308-L342
After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled
is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877
After further investigations with the help of @juldrixx we are not sure that the option
nifi.security.autoreload.enabled
is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877
We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices
attached to controller services or processors. We think they could, but that's not how it's currently written.
However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService
in deployed NifiDataflow
. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.
After further investigations with the help of @juldrixx we are not sure that the option
nifi.security.autoreload.enabled
is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any
SSLContextServices
attached to controller services or processors. We think they could, but that's not how it's currently written.However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with
SSLContextService
in deployedNifiDataflow
. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.
The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.
After further investigations with the help of @juldrixx we are not sure that the option
nifi.security.autoreload.enabled
is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any
SSLContextServices
attached to controller services or processors. We think they could, but that's not how it's currently written. However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors withSSLContextService
in deployedNifiDataflow
. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.
If i'm not mistaken, accessing the UI is what the nifi.security.autoreload.enabled
feature allows for when certs are rotated. NiFI's Jetty web server uses the FrameworkServerConnectorFactory
, which reloads certs from disk, to create connections. It uses Jetty's KeyStoreScanner
and TrustStoreScanner
to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.
I have recreated the problem. I set a Duration
of 1h and a RenewBefore
of 5m, so the certificates expire after 1h. When the certificates have expired, the user interface is not accessible and the operator cannot communicate with NiFi. Whether the nifi.security.autoreload.enabled
function is enabled or not makes no difference.
After further investigations with the help of @juldrixx we are not sure that the option
nifi.security.autoreload.enabled
is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any
SSLContextServices
attached to controller services or processors. We think they could, but that's not how it's currently written. However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors withSSLContextService
in deployedNifiDataflow
. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.
If i'm not mistaken, accessing the UI is what the
nifi.security.autoreload.enabled
feature allows for when certs are rotated. NiFI's Jetty web server uses theFrameworkServerConnectorFactory
, which reloads certs from disk, to create connections. It uses Jetty'sKeyStoreScanner
andTrustStoreScanner
to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.
I dug around trying to figure out why it wasn't working.
When you mount the secret in the pod via a volume, it generates a symlink pyramid where the first symlink truststore.jks
points to a second symlink ..data/truststore.jks
which points to ..2022_08_31_19_45_44.079808492/truststore.jks
. In the nifi.properties
, it is told to look at truststore.jks
but in reality the scan that tracks the state of the file monitors ..2022_08_31_19_45_44.079808492/truststore.jks
. Now when the Certificate is updated, the secret is also updated and then reinjected into the pods but not under ..2022_08_31_19_45_44.079808492/truststore.jks
but under another timestamp ..<NEW_TIMESTAMP>/truststore.jks
. From the point of view of the symlink we specify in the nifi.properties
this update is transparent but not to the scan that monitors the file. For him it has not been updated but deleted and stops monitoring it.
So it seems that the function can't be used unless you know a method to delete this timestamp directory (I didn't find it).
Thank you for spending the time to track this down. That makes sense. It looks like Jetty has a unit test that specifically handles symlink targets. I wonder if it's in a newer version of Jetty?
https://github.com/eclipse/jetty.project/blob/cb127793e5d8b5c5730b964392a9a905ba49191d/tests/test-integration/src/test/java/org/eclipse/jetty/test/KeyStoreScannerTest.java#L208-L238
More details here https://github.com/eclipse/jetty.project/pull/5042
I've talked to the folks at NiFi and @michael81877, and it doesn't seem like this is expected behavior (link to discussion). I have opened a ticket at NiFi to have this fixed. Now the question is what to do with the PR. We can keep the feature but make it optional to have a system that would work in NiFi versions without the fix, or we drop it and hope that on the NiFi side the fix is done for the next release and that it fixes the problem.
Yeah perhaps we can still pursue node restarts and make that a flag as a part of the SSL config. Then it wouldn't matter what NiFi version the operator is controlling. And then starting from whatever version of NiFi has the fix, the restarts are no longer necessary.
@erdrix @michael81877 ready for another review 😄