security icon indicating copy to clipboard operation
security copied to clipboard

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15

Open blueish-eyez opened this issue 1 year ago • 28 comments

Describe the bug

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

[2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [opensearch-0] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
	at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is:

[INFO ][o.o.j.s.JobSweeper       ] [opensearch-0] Running full sweep

But I feel it's a long shot.

I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever?

Any help is highly appreciated!

Related component

Other

To Reproduce

  1. Setup a working cluster on 2.11 with Security plugin
  2. Upgrade the cluster to 2.15
  3. Check worker node logs for SSL exceptions

Expected behavior

No SSL exceptions post upgrade

Additional Details

Plugins Security plugin

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Docker image based OpenSearch 2.15

Additional context Add any other context about the problem here.

blueish-eyez avatar Aug 16 '24 10:08 blueish-eyez

@blueish-eyez I tried this on my local but I can't reproduce it successfully. Here is what I have done:

  1. Create a 2.11 opensearch cluster with docker-compose file and create several documents in a index.
  2. Change the docker-compose file opensearch version to 2.16 and restart. Can you share steps that can consistently reproduce the issue?

zane-neo avatar Aug 19 '24 03:08 zane-neo

Thinking about it in larger picture there will be plenty of additional points of influence here.. it's a production grade cluster (although sandbox, but with the same principles of access and usage). There are java applications directly writing to the cluster. I tried setting up docker-compose and upgrade but I also could not reproduce in such scenario. I also cannot reproduce on another regular production grade cluster that is using traditional filebeat-logstash-opensearch write pipeline.

Is it possible to decode [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport], exactly the "o.o.h.n.s" bit, as mentioned in https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/? Like o.o.i.r stands for logger.org.opensearch.index.reindex. Perhaps knowing the call it was made from I could pin-point the faulty module?

blueish-eyez avatar Aug 19 '24 09:08 blueish-eyez

The o.o.h.n.s stands for org.opensearch.http.netty4.ssl, you can try change the log level of this package or the class org.opensearch.http.netty4.ssl.SecureNetty4HttpServerTransport to see if more details can be found.

zane-neo avatar Aug 20 '24 03:08 zane-neo

Hi. I'm experiencing the same issue. In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

The strange thing is that it's happening exactly once every 10 hours, and almost exclusively on the coordinating nodes.

image

This happens even when no indexing or query requests are being sent.

I've created and tested clusters with different versions, and the same issue doesn't happen with 2.13.0 and earlier, but only with 2.14.0 and later.

  • 2.12.0: not occurred
  • 2.13.0: not occurred
  • 2.14.0: occurred
  • 2.16.0: occurred

So, I'm not sure, but I suspect it's related to this commit.

machenity avatar Sep 02 '24 04:09 machenity

Hi The same phenomenon occurs to me, and the issue seems to exist starting from v2.14.0 or higher.

10000-ki avatar Sep 02 '24 05:09 10000-ki

We still need to narrow this down and get an easy repro, maybe @stephen-crawford can help us here?

dblock avatar Sep 02 '24 14:09 dblock

@dblock I will also take a look

reta avatar Sep 03 '24 12:09 reta

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

@blueish-eyez do you have periodic pattern as well or it is very random?

reta avatar Sep 03 '24 17:09 reta

OK folks, I think the mystery is resolved: TLDR; no functional regressions have been introduced.

So pre-2.14.0, the secure HTTP transport didn't log any exceptions (see please https://github.com/opensearch-project/security/blob/2.13.0.0/src/main/java/org/opensearch/security/ssl/OpenSearchSecuritySSLPlugin.java#L270 where the error handler was set to NOOP).

In 2.14.0 and later, the handler was switched from NOOP to the one which logs (see please https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L2125) and here is why the exceptions are appearing now.

The takeaway is that those were present before but swallowed.

reta avatar Sep 03 '24 17:09 reta

@reta Thanks! There's still something causing these errors that's not supposed to, no?

dblock avatar Sep 04 '24 13:09 dblock

@reta Thanks! There's still something causing these errors that's not supposed to, no?

Thanks @dblock , correct, these errors are caused by clients closing the connection. It is not possible to pinpoint the exact reasons but just a few:

  • proxy / gateways in the middle may close the connection
  • application / service crash that abruptly closes the connection
  • ...

reta avatar Sep 04 '24 13:09 reta

@reta thanks for the debug!

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

My clusters were provisioned on top of our private cloud, in-house Kubernetes, so all coordinating nodes were behind the LoadBalancer service. I asked our infrastructure team if there is any monitoring job involved.

Thanks again 🙇

machenity avatar Sep 04 '24 14:09 machenity

I checked it again, and the 10-hour duration patterns are only for low-used clusters. For the highly used clusters, these logs have occurred randomly.

machenity avatar Sep 05 '24 06:09 machenity

Hi @reta I restarted one of the worker nodes and collected sample stdout for roughly 30 minutes. The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this? Please also note that while others mentioned they are seeing this on master nodes, I'm seeing this only on worker nodes.

blueish-eyez avatar Sep 05 '24 10:09 blueish-eyez

The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?

Thanks @blueish-eyez , yes, I will be working on restoring the previous behavior when such exceptions where swallowed. @dblock could you please transfer this ticket to security plugin?

reta avatar Sep 05 '24 12:09 reta

[Catch All Triage - 1, 2, 3, 4, 5]

dblock avatar Sep 09 '24 16:09 dblock

@blueish-eyez the problem with sslExceptionHandler shoudl be fixed in 2.18.0 but it may not get rid of the this exceptions in the logs, fe I could reproduce (by simulation) those on 2.11.0:

[2024-09-10T19:45:25,650][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [21140679ffc1] Exception during establishing a SSL connection: java.net.SocketException: Connection reset                                                                                                                                                           
java.net.SocketException: Connection reset                                                                                                                                                                                                                                                                                                            
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]                                                                                                                                                                                                                                                       
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]                                                                                                                                                                                                                                                                       
        at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]

By any chance, did you (re)configure logging on your clusters? One notable difference between 2.11.x and 2.15.x is that transports are now in different package.

reta avatar Sep 10 '24 19:09 reta

@reta thanks for the update. Logging is configured at info level and I believe this was the baseline. No specific packages have elevated / suppressed messaging (except for testing that suggested @zane-neo in the first response, that however did not show any relevant messages around SSL exceptions). For the time being I can set org.opensearch.http.netty4.ssl to FATAL until I can get the 2.18 image to test with

blueish-eyez avatar Sep 11 '24 13:09 blueish-eyez

FYI: we see the same exceptions (currently running 2.17, didn't check when the exceptions were introduced in the logs).

A very simple way to reproduce - and how we noticed the exceptions in the first place: try the Nagios check_http plugin.

# /usr/lib/nagios/plugins/check_http -H opensearch-host01.int.lan --port=9200 --ssl --certificate=15
OK - Certificate 'opensearch-host01.int.lan' will expire on Sun 17 Nov 2024 06:05:05 AM GMT +0000.

Running the check will immediately trigger an exception. We didn't change anything in our monitoring and the check_http script wasn't updated for a long time either.

rlueckl avatar Oct 16 '24 11:10 rlueckl

@rlueckl thanks, could you reproduce it with docker setup? I cannot, the log is clean:

$ docker run -it -p 9200:9200 -p 9600:9600 -e OPENSEARCH_INITIAL_ADMIN_PASSWORD=_ad0m#Ns_ -e "discovery.type=single-node"  opensearchproject/opensearch:2.17.1
$ /usr/local/nagios/libexec/check_http  -H localhost --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 12:03 -0500/EST.

reta avatar Oct 16 '24 19:10 reta

Hi @reta, which version of check_http are you using? I've tested the docker image on my workstation but could not reproduce the issue, but my Linux Mint has a slightly older version of check_http than the servers.

check_http 2.3.1 (workstation) -> docker 2.17.0 (workstation) -> no exception check_http 2.3.1 (workstation) -> docker 2.17.1 (workstation) -> no exception check_http 2.3.1 (workstation) -> deb install 2.17.0 (server) -> no exception check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception check_http 2.3.3 (server) -> docker 2.17.x (workstation) -> can't test, firewall between the server and my workstation.

The last test would be interesting if you could test with check_http 2.3.3 and the docker image to see if there's an exception. Unfortunately I'm somewhat limited because of our company firewalls and can't test this case.

EDIT: also tested with check_http 2.4.9 -> no exception. So it seems that the exception is (only) triggered by check_http 2.3.3 (maybe other versions, I want to test 2.3.5 as soon as I have some time).

rlueckl avatar Oct 17 '24 06:10 rlueckl

Hi @reta, which version of check_http are you using?

Thanks @rlueckl

$ /usr/local/nagios/libexec/check_http  --version
check_http v2.4.11 (nagios-plugins 2.4.11)

I will try this configuration now, will update you shortly:

  • check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception

UPD: Debian Bookworm + OpenSearch 2.17.1 (on Docker):

root@c2d1f85d6cd6:/tmp/nagios-plugins-2.3.3# /usr/local/nagios/libexec/check_http -H c2d1f85d6cd6 --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 17:03 +0000/UTC.

Log is clean :(

reta avatar Oct 17 '24 13:10 reta

Interesting. I can't reproduce the exception with 2.3.3 from my workstation, but running the check locally on the server I get the exception.

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception check_http 2.3.3 (workstation) -> deb install 2.17.0 (server) -> no exception

rlueckl avatar Oct 18 '24 07:10 rlueckl

I encountered the same error, and after checking, I found the following:

  • The error occurs sporadically 0 to 3 times per hour, but only on the master node.
  • The error happens regardless of the version (OpenSearch 2.13.0, 2.14.0, 2.16.0).
  • The error does not occur when the elasticsearch-exporter:v1.8.0 is stopped.

https://forum.opensearch.org/t/ssl-exception-connection-reset-error-on-master-nodes/22281/5

kkoki-ctrl avatar Nov 08 '24 02:11 kkoki-ctrl

@reta I see that you are assigned to this issue and relevant PR: #4725 has been merged. Can we close this issue?

DarshitChanpura avatar Feb 17 '25 16:02 DarshitChanpura

@reta I see that you are assigned to this issue and relevant PR: #4725 has been merged. Can we close this issue?

Yeah, I think we could close this issue for now (no more reports), thank you @DarshitChanpura

reta avatar Feb 18 '25 00:02 reta

i still get this error in 2.19.0 if i e.g. use curl to GET / on OpenSearch running within kubernetes (specifically: OpenShift). namely i do get it when i use curl from my PC (to call a k8s Ingress / OpenShift Route) but i do not get it when i use curl within the same pod or another pod in the same namespace in k8s.

what's weird is that the error is about the transport layer while i'm calling it on the normal REST layer. also, i get the error even when running with a single node - why would it then even trigger a transport layer call?

log entry:

{"type": "server", "timestamp": "2025-02-18T13:56:12,003Z", "level": "ERROR", "component": "o.o.h.n.s.SecureNetty4HttpServerTransport", "cluster.name": "test-opensearch", "node.name": "test-opensearch-0", "message": "Exception during establishing a SSL connection: java.net.SocketException: Connection reset", "cluster.uuid": "PtisTOKQRbWzJ2jTdNGsQQ", "node.id": "t4q2XqESQgmbMvBinYxBFQ" ,
"stacktrace": ["java.net.SocketException: Connection reset",
"at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]",
"at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]",
"at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.0.jar:2.19.0]",
"at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.0.jar:2.19.0]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.117.Final.jar:4.1.117.Final]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.117.Final.jar:4.1.117.Final]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$PrivilegedOpenSearchThreadFactory$1.lambda$run$0(OpenSearchExecutors.java:454) [opensearch-2.19.0.jar:2.19.0]",
"at java.base/java.security.AccessController.doPrivileged(AccessController.java:319) [?:?]",
"at org.opensearch.common.util.concurrent.OpenSearchExecutors$PrivilegedOpenSearchThreadFactory$1.run(OpenSearchExecutors.java:453) [opensearch-2.19.0.jar:2.19.0]",
"at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]"] }

see also this slack conversation

rursprung avatar Feb 18 '25 16:02 rursprung

Hi there. Unfortunately, same problem here :(.

  • I can observing it on our old previous setup with version 2.15.0 and on the new version 2.19.1
  • occurring randomly, most of the cases are on coordinator node type
  • occurring on the single multi-role node as well
  • OpenSearch instances are in Kubernetes clusters (I not tested the version in pure Docker on localhost)

I afraid, that all interesting information was provided from previous comments, but if you need something specify, please, let me know.

EDIT: It seems, that this is correct assumption from @kkoki-ctrl :

The error does not occur when the elasticsearch-exporter:v1.8.0 is stopped.

I can confirm, when I stop the Exporter (prometheuscommunity/elasticsearch-exporter:v1.8.0), the errors stop as well. Unfortunately, it is not possible to not have monitored OpenSearch cluster in the real world. I mean, not using the exporter isnt solution for us.

LHozzan avatar Mar 06 '25 12:03 LHozzan