nifikop icon indicating copy to clipboard operation
nifikop copied to clipboard

[Feature/Certificate] Restart node when certificate updated

Open PLsergent opened this issue 2 years ago • 13 comments

Q A
Bug fix? no
New feature? yes
API breaks? no
Deprecations? no
License Apache 2.0

What's in this PR?

This PR aims at creating a new field in the struct NodeState called CertificateExpireDate, and then automate the Nifi cluster rolling upgrade when the certificate is renewed (handle by the cert manager).

The idea is that we will give the field CertificateExpireDate the value of the current certificate renewal date, during each reconcile we will check if the CertificateExpireDate field has the same value as the current certificate renewal date, if no, we will do a rolling upgrade of the cluster and then assign the new renewal date to each node.

Why?

When the certificate handle by the cert manager was renewed it was necessary to restart the Nifi nodes manually to be able to access the UI.

Additional context

Checklist

  • [x] Implementation tested
  • [x] Error handling code meets the guideline
  • [x] Logging code meets the guideline
  • [x] User guide and development docs updated (if needed)
  • [x] Append changelog with changes

ToDo

  • [x] Test implementation after rebase

PLsergent avatar Aug 24 '22 15:08 PLsergent

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? Link If you set nifi.security.autoreload.enabled in nifi.properties, any changes to the JKS keystore get auto-loaded without a restart. Admittedly, I've not tried this in a k8s environment though - so might not be as slick as advertised here!

It might be worth testing out that workflow, before getting the operator to do the cert management via this PR?

r65535 avatar Aug 24 '22 16:08 r65535

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535

I never heard about that so yes I'll try to use the feature before going further with this PR.

PLsergent avatar Aug 26 '22 07:08 PLsergent

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535

I never heard about that so yes I'll try to use the feature before going further with this PR.

If cert-manager automatically renews certificates and the configured Secrets get updated, then Kubernetes will automatically update the volumes mounted to the running Pods:

https://kubernetes.io/docs/concepts/configuration/secret/#mounted-secrets-are-updated-automatically

If this is the case, then we might be able to take advantage of that feature of NiFi automatically. There's one condition where Kubernetes will not automatically update Secrets mounted as volumes in Pods when a mount subPath is used. I double checked and made sure nifikop doesn't set this:

https://github.com/konpyutaika/nifikop/blob/88743195e24bbf7d0c02ab3a238dd2dfb01a544b/pkg/resources/nifi/pod.go#L308-L342

mh013370 avatar Aug 26 '22 10:08 mh013370

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

PLsergent avatar Aug 30 '22 15:08 PLsergent

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.

However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

mh013370 avatar Aug 31 '22 09:08 mh013370

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.

However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

juldrixx avatar Aug 31 '22 10:08 juldrixx

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written. However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

If i'm not mistaken, accessing the UI is what the nifi.security.autoreload.enabled feature allows for when certs are rotated. NiFI's Jetty web server uses the FrameworkServerConnectorFactory, which reloads certs from disk, to create connections. It uses Jetty's KeyStoreScanner and TrustStoreScanner to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.

mh013370 avatar Aug 31 '22 10:08 mh013370

I have recreated the problem. I set a Duration of 1h and a RenewBefore of 5m, so the certificates expire after 1h. When the certificates have expired, the user interface is not accessible and the operator cannot communicate with NiFi. Whether the nifi.security.autoreload.enabled function is enabled or not makes no difference.

juldrixx avatar Aug 31 '22 16:08 juldrixx

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written. However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

If i'm not mistaken, accessing the UI is what the nifi.security.autoreload.enabled feature allows for when certs are rotated. NiFI's Jetty web server uses the FrameworkServerConnectorFactory, which reloads certs from disk, to create connections. It uses Jetty's KeyStoreScanner and TrustStoreScanner to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.

I dug around trying to figure out why it wasn't working. When you mount the secret in the pod via a volume, it generates a symlink pyramid where the first symlink truststore.jks points to a second symlink ..data/truststore.jks which points to ..2022_08_31_19_45_44.079808492/truststore.jks. In the nifi.properties, it is told to look at truststore.jks but in reality the scan that tracks the state of the file monitors ..2022_08_31_19_45_44.079808492/truststore.jks. Now when the Certificate is updated, the secret is also updated and then reinjected into the pods but not under ..2022_08_31_19_45_44.079808492/truststore.jks but under another timestamp ..<NEW_TIMESTAMP>/truststore.jks. From the point of view of the symlink we specify in the nifi.properties this update is transparent but not to the scan that monitors the file. For him it has not been updated but deleted and stops monitoring it.

So it seems that the function can't be used unless you know a method to delete this timestamp directory (I didn't find it).

juldrixx avatar Aug 31 '22 21:08 juldrixx

Thank you for spending the time to track this down. That makes sense. It looks like Jetty has a unit test that specifically handles symlink targets. I wonder if it's in a newer version of Jetty?

https://github.com/eclipse/jetty.project/blob/cb127793e5d8b5c5730b964392a9a905ba49191d/tests/test-integration/src/test/java/org/eclipse/jetty/test/KeyStoreScannerTest.java#L208-L238

More details here https://github.com/eclipse/jetty.project/pull/5042

mh013370 avatar Sep 01 '22 11:09 mh013370

I've talked to the folks at NiFi and @michael81877, and it doesn't seem like this is expected behavior (link to discussion). I have opened a ticket at NiFi to have this fixed. Now the question is what to do with the PR. We can keep the feature but make it optional to have a system that would work in NiFi versions without the fix, or we drop it and hope that on the NiFi side the fix is done for the next release and that it fixes the problem.

juldrixx avatar Sep 01 '22 14:09 juldrixx

Yeah perhaps we can still pursue node restarts and make that a flag as a part of the SSL config. Then it wouldn't matter what NiFi version the operator is controlling. And then starting from whatever version of NiFi has the fix, the restarts are no longer necessary.

mh013370 avatar Sep 01 '22 14:09 mh013370

@erdrix @michael81877 ready for another review 😄

PLsergent avatar Dec 15 '22 14:12 PLsergent