security [BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15

Describe the bug

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

[2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [opensearch-0] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
	at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is:

[INFO ][o.o.j.s.JobSweeper       ] [opensearch-0] Running full sweep

But I feel it's a long shot.

I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever?

Any help is highly appreciated!

Related component

Other

To Reproduce

Setup a working cluster on 2.11 with Security plugin
Upgrade the cluster to 2.15
Check worker node logs for SSL exceptions

Expected behavior

No SSL exceptions post upgrade

Additional Details

Plugins Security plugin

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Docker image based OpenSearch 2.15

Additional context Add any other context about the problem here.

Aug 16 '24 10:08 blueish-eyez

@blueish-eyez I tried this on my local but I can't reproduce it successfully. Here is what I have done:

Create a 2.11 opensearch cluster with docker-compose file and create several documents in a index.
Change the docker-compose file opensearch version to 2.16 and restart. Can you share steps that can consistently reproduce the issue?

Aug 19 '24 03:08 zane-neo

Thinking about it in larger picture there will be plenty of additional points of influence here.. it's a production grade cluster (although sandbox, but with the same principles of access and usage). There are java applications directly writing to the cluster. I tried setting up docker-compose and upgrade but I also could not reproduce in such scenario. I also cannot reproduce on another regular production grade cluster that is using traditional filebeat-logstash-opensearch write pipeline.

Is it possible to decode [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport], exactly the "o.o.h.n.s" bit, as mentioned in https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/? Like o.o.i.r stands for logger.org.opensearch.index.reindex. Perhaps knowing the call it was made from I could pin-point the faulty module?

Aug 19 '24 09:08 blueish-eyez

The o.o.h.n.s stands for org.opensearch.http.netty4.ssl, you can try change the log level of this package or the class org.opensearch.http.netty4.ssl.SecureNetty4HttpServerTransport to see if more details can be found.

Aug 20 '24 03:08 zane-neo

Hi. I'm experiencing the same issue. In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

The strange thing is that it's happening exactly once every 10 hours, and almost exclusively on the coordinating nodes.

This happens even when no indexing or query requests are being sent.

I've created and tested clusters with different versions, and the same issue doesn't happen with 2.13.0 and earlier, but only with 2.14.0 and later.

2.12.0: not occurred
2.13.0: not occurred
2.14.0: occurred
2.16.0: occurred

So, I'm not sure, but I suspect it's related to this commit.

Sep 02 '24 04:09 machenity

Hi The same phenomenon occurs to me, and the issue seems to exist starting from v2.14.0 or higher.

Sep 02 '24 05:09 10000-ki

We still need to narrow this down and get an easy repro, maybe @stephen-crawford can help us here?

Sep 02 '24 14:09 dblock

@dblock I will also take a look

Sep 03 '24 12:09 reta

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

@blueish-eyez do you have periodic pattern as well or it is very random?

Sep 03 '24 17:09 reta

OK folks, I think the mystery is resolved: TLDR; no functional regressions have been introduced.

So pre-2.14.0, the secure HTTP transport didn't log any exceptions (see please https://github.com/opensearch-project/security/blob/2.13.0.0/src/main/java/org/opensearch/security/ssl/OpenSearchSecuritySSLPlugin.java#L270 where the error handler was set to NOOP).

In 2.14.0 and later, the handler was switched from NOOP to the one which logs (see please https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L2125) and here is why the exceptions are appearing now.

The takeaway is that those were present before but swallowed.

Sep 03 '24 17:09 reta

@reta Thanks! There's still something causing these errors that's not supposed to, no?

Sep 04 '24 13:09 dblock

@reta Thanks! There's still something causing these errors that's not supposed to, no?

Thanks @dblock , correct, these errors are caused by clients closing the connection. It is not possible to pinpoint the exact reasons but just a few:

proxy / gateways in the middle may close the connection
application / service crash that abruptly closes the connection
...

Sep 04 '24 13:09 reta

@reta thanks for the debug!

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

My clusters were provisioned on top of our private cloud, in-house Kubernetes, so all coordinating nodes were behind the LoadBalancer service. I asked our infrastructure team if there is any monitoring job involved.

Thanks again 🙇

Sep 04 '24 14:09 machenity

I checked it again, and the 10-hour duration patterns are only for low-used clusters. For the highly used clusters, these logs have occurred randomly.

Sep 05 '24 06:09 machenity

Hi @reta I restarted one of the worker nodes and collected sample stdout for roughly 30 minutes. The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this? Please also note that while others mentioned they are seeing this on master nodes, I'm seeing this only on worker nodes.

Sep 05 '24 10:09 blueish-eyez

The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?

Thanks @blueish-eyez , yes, I will be working on restoring the previous behavior when such exceptions where swallowed. @dblock could you please transfer this ticket to security plugin?

Sep 05 '24 12:09 reta

[Catch All Triage - 1, 2, 3, 4, 5]

Sep 09 '24 16:09 dblock

@blueish-eyez the problem with sslExceptionHandler shoudl be fixed in 2.18.0 but it may not get rid of the this exceptions in the logs, fe I could reproduce (by simulation) those on 2.11.0:

[2024-09-10T19:45:25,650][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [21140679ffc1] Exception during establishing a SSL connection: java.net.SocketException: Connection reset                                                                                                                                                           
java.net.SocketException: Connection reset                                                                                                                                                                                                                                                                                                            
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]                                                                                                                                                                                                                                                       
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]                                                                                                                                                                                                                                                                       
        at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]

By any chance, did you (re)configure logging on your clusters? One notable difference between 2.11.x and 2.15.x is that transports are now in different package.

Sep 10 '24 19:09 reta

@reta thanks for the update. Logging is configured at info level and I believe this was the baseline. No specific packages have elevated / suppressed messaging (except for testing that suggested @zane-neo in the first response, that however did not show any relevant messages around SSL exceptions). For the time being I can set org.opensearch.http.netty4.ssl to FATAL until I can get the 2.18 image to test with

Sep 11 '24 13:09 blueish-eyez

FYI: we see the same exceptions (currently running 2.17, didn't check when the exceptions were introduced in the logs).

A very simple way to reproduce - and how we noticed the exceptions in the first place: try the Nagios check_http plugin.

# /usr/lib/nagios/plugins/check_http -H opensearch-host01.int.lan --port=9200 --ssl --certificate=15
OK - Certificate 'opensearch-host01.int.lan' will expire on Sun 17 Nov 2024 06:05:05 AM GMT +0000.

Running the check will immediately trigger an exception. We didn't change anything in our monitoring and the check_http script wasn't updated for a long time either.

Oct 16 '24 11:10 rlueckl

@rlueckl thanks, could you reproduce it with docker setup? I cannot, the log is clean:

$ docker run -it -p 9200:9200 -p 9600:9600 -e OPENSEARCH_INITIAL_ADMIN_PASSWORD=_ad0m#Ns_ -e "discovery.type=single-node"  opensearchproject/opensearch:2.17.1

$ /usr/local/nagios/libexec/check_http  -H localhost --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 12:03 -0500/EST.

Oct 16 '24 19:10 reta

Hi @reta, which version of check_http are you using? I've tested the docker image on my workstation but could not reproduce the issue, but my Linux Mint has a slightly older version of check_http than the servers.

check_http 2.3.1 (workstation) -> docker 2.17.0 (workstation) -> no exception check_http 2.3.1 (workstation) -> docker 2.17.1 (workstation) -> no exception check_http 2.3.1 (workstation) -> deb install 2.17.0 (server) -> no exception check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception check_http 2.3.3 (server) -> docker 2.17.x (workstation) -> can't test, firewall between the server and my workstation.

The last test would be interesting if you could test with check_http 2.3.3 and the docker image to see if there's an exception. Unfortunately I'm somewhat limited because of our company firewalls and can't test this case.

EDIT: also tested with check_http 2.4.9 -> no exception. So it seems that the exception is (only) triggered by check_http 2.3.3 (maybe other versions, I want to test 2.3.5 as soon as I have some time).

Oct 17 '24 06:10 rlueckl

Hi @reta, which version of check_http are you using?

Thanks @rlueckl

$ /usr/local/nagios/libexec/check_http  --version
check_http v2.4.11 (nagios-plugins 2.4.11)

I will try this configuration now, will update you shortly:

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception

UPD: Debian Bookworm + OpenSearch 2.17.1 (on Docker):

root@c2d1f85d6cd6:/tmp/nagios-plugins-2.3.3# /usr/local/nagios/libexec/check_http -H c2d1f85d6cd6 --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 17:03 +0000/UTC.

Log is clean :(

Oct 17 '24 13:10 reta

Interesting. I can't reproduce the exception with 2.3.3 from my workstation, but running the check locally on the server I get the exception.

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception check_http 2.3.3 (workstation) -> deb install 2.17.0 (server) -> no exception

Oct 18 '24 07:10 rlueckl

I encountered the same error, and after checking, I found the following:

The error occurs sporadically 0 to 3 times per hour, but only on the master node.
The error happens regardless of the version (OpenSearch 2.13.0, 2.14.0, 2.16.0).
The error does not occur when the elasticsearch-exporter:v1.8.0 is stopped.

https://forum.opensearch.org/t/ssl-exception-connection-reset-error-on-master-nodes/22281/5

Nov 08 '24 02:11 kkoki-ctrl

@reta I see that you are assigned to this issue and relevant PR: #4725 has been merged. Can we close this issue?

Feb 17 '25 16:02 DarshitChanpura

@reta I see that you are assigned to this issue and relevant PR: #4725 has been merged. Can we close this issue?

Yeah, I think we could close this issue for now (no more reports), thank you @DarshitChanpura

Feb 18 '25 00:02 reta

i still get this error in 2.19.0 if i e.g. use curl to GET / on OpenSearch running within kubernetes (specifically: OpenShift). namely i do get it when i use curl from my PC (to call a k8s Ingress / OpenShift Route) but i do not get it when i use curl within the same pod or another pod in the same namespace in k8s.

what's weird is that the error is about the transport layer while i'm calling it on the normal REST layer. also, i get the error even when running with a single node - why would it then even trigger a transport layer call?

log entry:

{"type": "server", "timestamp": "2025-02-18T13:56:12,003Z", "level": "ERROR", "component": "o.o.h.n.s.SecureNetty4HttpServerTransport", "cluster.name": "test-opensearch", "node.name": "test-opensearch-0", "message": "Exception during establishing a SSL connection: java.net.SocketException: Connection reset", "cluster.uuid": "PtisTOKQRbWzJ2jTdNGsQQ", "node.id": "t4q2XqESQgmbMvBinYxBFQ" ,
"stacktrace": ["java.net.SocketException: Connection reset",
"at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]",
"at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]",
"at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.0.jar:2.19.0]",
"at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.0.jar:2.19.0]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.117.Final.jar:4.1.117.Final]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$PrivilegedOpenSearchThreadFactory$1.lambda$run$0(OpenSearchExecutors.java:454) [opensearch-2.19.0.jar:2.19.0]",
"at java.base/java.security.AccessController.doPrivileged(AccessController.java:319) [?:?]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$PrivilegedOpenSearchThreadFactory$1.run(OpenSearchExecutors.java:453) [opensearch-2.19.0.jar:2.19.0]",
"at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]"] }