grpc-node icon indicating copy to clipboard operation
grpc-node copied to clipboard

Error in `node': corrupted double-linked list (TLS / SSL crashes)

Open theogravity opened this issue 5 years ago • 14 comments

Problem description

App crashes with a core trace. Unsure what is the cause. I've attached the trace log.

dump.txt

Excerpt from it:

*** Error in `node': corrupted double-linked list: 0x000000002b02cca0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f894cfb4bfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7f894cfbafc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x7733d)[0x7f894cfbb33d]
/lib/x86_64-linux-gnu/libc.so.6(+0x78dfa)[0x7f894cfbcdfa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f894cfbef64]
node(CRYPTO_zalloc+0x61)[0x14460b1]
node(SSL_new+0x39)[0x13519f9]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x6c845)[0x7f894c495845]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x51076)[0x7f894c47a076]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x2b357)[0x7f894c454357]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x6d7d5)[0x7f894c4967d5]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x7c2f5)[0x7f894c4a52f5]
node[0xa55559]
node[0xa5a5c8]
node(uv_run+0x14b)[0xa4a21b]
node(_ZN4node5StartEPN2v87IsolateEPNS_11IsolateDataERKSt6vectorISsSaISsEES9_+0x565)[0x8e6f45]
node(_ZN4node5StartEiPPc+0x469)[0x8e5239]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f894cf642e1]
node[0x89ed85]

...
  • This may have started happening with a client that implemented managed channels on the client side, but unsure how that would lead to crashes on the server.
    • The client may hold a channel open between 5-30s before closing entirely (which I understand would contribute to memory consumption to keep the socket open)
  • Our grpc servers get millions of grpc connections a day, don't generally have control over the client impls (there's multiple client/devices) of how they connect to the server
  • We use a cluster of docker containers with 3GB of memory on multiple host boxes - horizontal / vertical scaling only delays the problem
  • Our grpc connection requires TLS
  • We're unable to reproduce the problem locally at the moment (when I do memory profiling, I see gc happening properly, but I'm only launching 5-10 of our own test clients at once to test our various flows), it only seems to happen in our high-traffic production environments

Questions:

  • (might not be one for you to answer) - If node runs out of memory, we wouldn't see a corrupted double-linked list type error, but an error specific to running out of memory, correct?
  • Any potential ideas of where / how to troubleshoot based on looking at the trace log?

Environment

grpc-node versions we've tried:

  • 1.20.3
  • 1.16.1

Node Version: 10.15.2 (we've also upgraded node along the way in the 10.x series)

OS: Debian-based (using docker node:10.5.2 image), we do not do any package updates (eg apt update, install latest devtools, etc)

Additional info

We've noticed that we get a lot of these entries in our logs - it could be clients that are misconfigured, unsure if it eventually leads to crashing:

E0702 22:22:08.625087997       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:1417A0C1:SSL routines:tls_post_process_client_hello:no shared cipher.
E0702 22:23:23.517829514       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106203.517802318","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:23:54.952124948       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106234.952105465","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:13.495415849       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown.
E0702 22:27:20.835176028       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106440.835163957","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:26.764072732       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106446.764040019","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:40.745594638       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown.
E0702 22:28:19.820591172       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106499.820576136","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}

We also get these in our logs too:

E0703 00:14:17.961441781       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112857.961430955","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:14:51.348432608       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112891.348418529","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:15:32.933723158       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112932.933711032","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:16:14.605935643       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112974.605925035","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:16:27.814974144       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112987.814961950","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}

theogravity avatar Jul 03 '19 01:07 theogravity

First, I think these are separate errors. I think the first "Handshake failed" error indicates that a client tried to connect with a TLS cypher set that the server does not support, and I'm less sure about the second error but my guess is that you're using mutual TLS and the client does not have a recognized certificate. The getpeername error is probably a relatively minor bug; I would bet that the error messages you're seeing are the main impact.

The "corrupted double-linked list" error is one I have also seen very occasionally but I also do not know how to reproduce it. I'm pretty sure it's actually not a memory usage problem; it looks like memory corruption. The first thing that would be useful to know is how frequently you see that error on your servers, in error rate over time and/or related to the number of calls handled. That will help figure out how to proceed.

One solution may be to use the @grpc/grpc-js library. The most recent release of that library added the server APIs. It's still in beta and may not have all of the functionality you need, but it definitely won't have this bug. And if you do try it out it would be very helpful to know what if anything is missing for your use case.

murgatroid99 avatar Jul 03 '19 21:07 murgatroid99

Thanks for the response. Unfortunately, we do not have control over the implementation of clients in many cases. We do not use mTLS for our handshaking.

I am interested in the grpc-js library, but we're unsure how much we'd have to change with our own codebase to have it work. We also do a lot of bi-directional streaming as well.

Given that the core dump references SSL, we're going to put a proxy in front of our service and have it perform TLS termination instead of grpc to remove TLS being a factor.

Will update when I'm able to.

theogravity avatar Jul 09 '19 01:07 theogravity

An update - putting a proxy in front def eliminated our SSL error messages, and this specific issue (as in we don't get the corrupted double-linked list crashes anymore). Our setup is:

Proxy does TLS termination -> grpc service running without TLS enabled (eg ServerCredentials.createInsecure() instead of ServerCredentials.createSsl()).

Given that, there has to be something with the TLS/SSL implementation here that's causing it.

Sorry the update took so long, had to go through our lengthy internal vetting/qa process before it was staged against production level traffic.

theogravity avatar Aug 27 '19 00:08 theogravity

The "corrupted double-linked list" error is one I have also seen very occasionally but I also do not know how to reproduce it.

This is a reproducible bug in the GRPC server. This relates to https://github.com/grpc/grpc/issues/19430 and I have been able to reproduce myself. Simply spamming a grpc server with nmap -p $port causes this to occur with and without TLS or listening on 0.0.0.0

The getpeerinfo failure is related, but I would imagine that the issue occurred before this failure. It feels like a race condition in the socket lifecycle, but that is an unsubstantiated guess.

Matt-Esch avatar Jan 12 '22 10:01 Matt-Esch

Running this repro with valgrind

E0113 14:50:31.035581800   24820 tcp_server_custom.cc:220]   getpeername error: {"created":"@1642085431.026475500","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":72,"grpc_status":14,"os_error":"socket is not connected"}
==24820== Invalid read of size 8
==24820==    at 0xBC3ED57: gpr_unref (sync.cc:104)
==24820==    by 0xBB51644: grpc_error_do_unref(grpc_error*) (error.cc:186)
==24820==    by 0xBBC9110: grpc_error_unref (error.h:196)
==24820==    by 0xBBC9110: finish_accept (tcp_server_custom.cc:221)
==24820==    by 0xBBC9110: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==  Address 0x91c2e60 is 0 bytes inside a block of size 232 free'd
==24820==    at 0x4D42D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24820==    by 0xBB52DA4: grpc_error_unref (error.h:196)
==24820==    by 0xBB52DA4: grpc_log_error(char const*, grpc_error*, char const*, int) (error.cc:813)
==24820==    by 0xBBC8F9F: grpc_log_if_error (error.h:271)
==24820==    by 0xBBC8F9F: finish_accept (tcp_server_custom.cc:220)
==24820==    by 0xBBC8F9F: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==  Block was alloc'd at
==24820==    at 0x4D43D2F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24820==    by 0xBC3CE53: gpr_realloc (alloc.cc:83)
==24820==    by 0xBB52477: get_placement (error.cc:205)
==24820==    by 0xBB52477: internal_set_str (error.cc:238)
==24820==    by 0xBB52477: grpc_error_set_str(grpc_error*, grpc_error_strs, grpc_slice const&) (error.cc:484)
==24820==    by 0xBB55559: tcp_error_create (tcp_uv.cc:78)
==24820==    by 0xBB55559: uv_socket_getpeername(grpc_custom_socket*, sockaddr const*, int*) (tcp_uv.cc:232)
==24820==    by 0xBBC8F71: finish_accept (tcp_server_custom.cc:216)
==24820==    by 0xBBC8F71: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==
E0113 14:50:31.052687900   24820 sync.cc:105]                assertion failed: prior > 0

Matt-Esch avatar Jan 13 '22 14:01 Matt-Esch

Appears to be a double unref bug in tcp_server_custom.cc, https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_server_custom.cc#L210

The call to GRPC_LOG_IF_ERROR takes ownership of the error and unrefs the error, as per the guidelines: (https://grpc.github.io/grpc/core/md_doc_core_grpc-error.html), so this extra unref call is unnecessary and causing a double-free bug.

Matt-Esch avatar Jan 13 '22 17:01 Matt-Esch

If that is the problem, it's fixed on the latest version of the library. The grpc package is on version 1.24 of the core, in which that bug was fixed in grpc/grpc#26750. That update was picked up in #1861 and published in version 1.24.11 of the library.

murgatroid99 avatar Jan 13 '22 18:01 murgatroid99

I no longer work at the company in question, but thank you for eventually fixing it / integrating the fix!

theogravity avatar Jan 13 '22 19:01 theogravity

Upgrading and replaying the repro I can confirm that this is indeed fixed in 1.24.11. There may be some scope for improving the peer read error handling (should an endpoint be created at all if we know the socket is already closed?) but I'm satisfied that this resolves the issue.

Matt-Esch avatar Jan 13 '22 21:01 Matt-Esch

This library has been deprecated for almost 9 months. We will still make security fixes like this one, but not broader improvements like that.

murgatroid99 avatar Jan 13 '22 21:01 murgatroid99

This library has been deprecated for almost 9 months. We will still make security fixes like this one, but not broader improvements like that.

I would put a notice on the project readme that it's in security-fix mode and to use the javascript-based client instead.

theogravity avatar Jan 13 '22 22:01 theogravity

The notice is there right at the top of the library's README, and it is marked as deprecated in npm. If you're talking about this repository as a whole, both packages live here so that would not be the appropriate place for that notice.

murgatroid99 avatar Jan 13 '22 23:01 murgatroid99

The notice is there right at the top of the library's README, and it is marked as deprecated in npm. If you're talking about this repository as a whole, both packages live here so that would not be the appropriate place for that notice.

Got it. Yeah, I was referring to the main repo for this lib. Following the link does go to the readme with the notice.

theogravity avatar Jan 13 '22 23:01 theogravity

We attempted an upgrade to grpc-js but we had performance issues so we're stuck with this for now. I suspect it's not that grpc-js is particularly slow on its own, rather it's just an issue of piling extra work on the main thread.

Matt-Esch avatar Jan 14 '22 03:01 Matt-Esch