Zookeeper SSL Failures When Certificate Is Rolled
Describe the bug
Zookeeper doesn't handle SSL certificate rolling gracefully. Specifically, if a certificate is rolled Zookeeper will continue to use the old, expired cert until it is restarted, which can lead to an outage as other components will be unable to communicate with it.
I'm not sure if this is an issue with the Pulsar Helm chart, or with Pulsar itself. If the latter, please let me know and I'll raise the issue there.
To Reproduce
Steps to reproduce the behavior: The following is valid for Pulsar 2.92 using Helm chart 2.92
- Deploy Pulsar into a K8s cluster using the Helm chart with tls enabled for zookeeper and certs managed by certmanager
- Wait for the certificate to be rolled
- See connections to zookeeper fail with "io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed...Caused by java.security.cert.CertificateExpiredException "
- Restart Pulsar Pods and see that errors go away
Expected behavior Pulsar should continue to operate normally when a certificate is rolled
I have the same issue. And zookeeper keeps failing with the below errors
2024-01-10T11:30:44,178+0000 [epollEventLoopGroup-7-1] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x02024-01-10T11:30:44,178+0000 [epollEventLoopGroup-7-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Exception caughtio.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_expired at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499) ~[io.netty-netty-codec-4.1.93.Final.jar:4.1.93.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[io.netty-netty-codec-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty-netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[io.netty-netty-transport-classes-epoll-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final] at java.lang.Thread.run(Thread.java:833) ~[?:?]Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_expired at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?] at sun.security.ssl.Alert.createSSLException(Alert.java:117) ~[?:?] at sun.security.ssl.TransportContext.fatal(TransportContext.java:365) ~[?:?] at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293) ~[?:?] at sun.security.ssl.TransportContext.dispatch(TransportContext.java:204) ~[?:?] at sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) ~[?:?] at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:736) ~[?:?] at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:691) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482) ~[?:?] at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:679) ~[?:?] at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:297) ~[io.netty-netty-handler-4.1.93.Final.jar:4.1.93.Final] at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353) ~[io.netty-netty-handler-4.1.93.Final.jar:4.1.93.Final]
I think the issue here is that although the Pulsar Helm Chart sets the zookeeper.client.certReload property, this isn't enough. All that property does is to get Zookeeper to update the certs when the truststore or keystore files change. When cert-manager updates the certs, this will cause the cert failes in pulsar/certs/zookeeper/ to update but nothing is going to update the keystore.
The other Pulsar components (e.g. the bookie) solve this by having code inside them that watches the files under /pulsar/certs/ and then updates the keystore accordingly. Zookeeper doesn't have such code and therefore it seems to me that the certs will never be refreshed.
I am encountering the same issue with version 3.3.0 of the helm chart. The Pulsar Pods threw SSL-Exception( "notAfter: 15.04.2024").
Restarting the pods solved the issue.