trantor icon indicating copy to clipboard operation
trantor copied to clipboard

OpenSSL TLS provider not working correctly

Open tripleslash opened this issue 2 years ago • 25 comments

When a HTTPs response with a large content body is sent, then not all bytes will be transferred and the connection will hang indefinitely until it reaches a timeout.

Example request:

curl -X 'GET' -k
'https://192.168.200.232/v2/endpoint'
-H 'accept: application/json'
-H 'Authorization: Basic dGVzdDo=' --trace-time -v Note: Unnecessary use of -X or --request, GET is already inferred. 15:44:55.711265 * Trying 192.168.200.232:443... 15:44:55.717252 * Connected to 192.168.200.232 (192.168.200.232) port 443 (#0) 15:44:55.719318 * ALPN, offering h2 15:44:55.719346 * ALPN, offering http/1.1 15:44:55.719472 * TLSv1.0 (OUT), TLS header, Certificate Status (22): 15:44:55.719503 * TLSv1.3 (OUT), TLS handshake, Client hello (1): 15:44:55.761049 * TLSv1.2 (IN), TLS header, Certificate Status (22): 15:44:55.761226 * TLSv1.3 (IN), TLS handshake, Server hello (2): 15:44:55.761775 * TLSv1.2 (IN), TLS header, Finished (20): 15:44:55.761849 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.761886 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): 15:44:55.761913 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.761999 * TLSv1.3 (IN), TLS handshake, Certificate (11): 15:44:55.763293 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.763344 * TLSv1.3 (IN), TLS handshake, CERT verify (15): 15:44:55.763410 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.763449 * TLSv1.3 (IN), TLS handshake, Finished (20): 15:44:55.763489 * TLSv1.2 (OUT), TLS header, Finished (20): 15:44:55.763513 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): 15:44:55.763558 * TLSv1.2 (OUT), TLS header, Supplemental data (23): 15:44:55.763591 * TLSv1.3 (OUT), TLS handshake, Finished (20): 15:44:55.763633 * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 15:44:55.763656 * ALPN, server did not agree to a protocol 15:44:55.763675 * Server certificate: 15:44:55.763690 * subject: 15:44:55.763699 * start date: Oct 4 08:46:14 2023 GMT 15:44:55.763707 * expire date: Sep 10 08:46:14 2123 GMT 15:44:55.763722 * issuer: 15:44:55.763752 * SSL certificate verify result: self-signed certificate in certificate chain (19), continuing anyway. 15:44:55.763795 * TLSv1.2 (OUT), TLS header, Supplemental data (23): 15:44:55.763831 > GET /v2/endpoint HTTP/1.1 15:44:55.763831 > Host: 192.168.200.232 15:44:55.763831 > User-Agent: curl/7.81.0 15:44:55.763831 > accept: application/json 15:44:55.763831 > Authorization: Basic dGVzdDo= 15:44:55.763831 > 15:44:55.776094 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.776234 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): 15:44:55.776353 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:55.776497 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): 15:44:55.776742 * old SSL session ID is stale, removing 15:44:56.901714 * TLSv1.2 (IN), TLS header, Supplemental data (23): 15:44:56.910324 * Mark bundle as not supporting multiuse 15:44:56.910420 < HTTP/1.1 200 OK 15:44:56.910559 < content-length: 56531 15:44:56.910639 < content-type: application/json; charset=utf-8 15:44:56.910772 < server: backend-api/0.1.0.6822 15:44:56.910938 < access-control-allow-origin: * 15:44:56.911114 < date: Mon, 27 Nov 2023 14:44:52 GMT 15:44:56.911246 < {"json_content_that_will_be_cut_off 15:44:56.912609 * TLSv1.2 (IN), TLS header, Supplemental data (23): ^C ////// <<<< timeout

Is it possible that the last chunk of data is not flushed to the output?

tripleslash avatar Nov 27 '23 14:11 tripleslash

Which version of trantor or drogon did you use?

an-tao avatar Nov 27 '23 15:11 an-tao

My trantor lib is built from commit tree https://github.com/an-tao/trantor/tree/26a599857a2fa5d61823ef85012e3d9123d5a675 as of right now and drogon is built from tag 1.9.0 tree.

I saw there were no relevant changes to the OpenSSL handling in either drogon or trantor since those tags

tripleslash avatar Nov 27 '23 15:11 tripleslash

If I keep the connection open and don't cancel it I get the following curl error after 30 seconds:

image

tripleslash avatar Nov 27 '23 15:11 tripleslash

Is it possible that the last chunk of data is not properly flushed to the TCP socket when the HTTP connection is closed? Maybe there are still some bytes pending and trantor does not wait until it is fully transferred to the client before closing the connection?

tripleslash avatar Nov 27 '23 15:11 tripleslash

I can't reproduce this error in my environment.

curl -X 'GET' -k https://10.0.0.100:1100/test -v -s -o /dev/null
*   Trying 10.0.0.100:1100...
* Connected to 10.0.0.100 (10.0.0.100) port 1100 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [6 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [1470 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [520 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  start date: Feb  9 12:37:34 2023 GMT
*  expire date: Jan 16 12:37:34 2123 GMT
*  issuer: C=TW; ST=Taipei; L=Taipei; O=Martin; OU=Martin; CN=Martin
*  SSL certificate verify result: self-signed certificate (18), continuing anyway.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /test HTTP/1.1
> Host: 10.0.0.100:1100
> User-Agent: curl/7.81.0
> Accept: */*
> 
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [233 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 56531
< content-type: text/html; charset=utf-8
< server: drogon/1.9.1
< date: Mon, 27 Nov 2023 15:49:03 GMT
< 
{ [16243 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection #0 to host 10.0.0.100 left intact

@marty1885 What do you think might be causing this issue?

an-tao avatar Nov 27 '23 15:11 an-tao

I had a similar issue once in an implementation of a TCP SSL socket using asio. Turned out I had to call asio::ssl_stream::shutdown before calling the socket close method. Maybe it could be something similar here? Where some bytes remain unsent on the socket until it is gracefully shut down (which would cause a flush)?

tripleslash avatar Nov 27 '23 17:11 tripleslash

Also happens with latest trantor (v1.5.15) and drogon (v1.9.1)

tripleslash avatar Nov 27 '23 20:11 tripleslash

Me too I can't replicate the issue what so ever. This is what is blocking me from fixing it. :(

@tripleslash Can you share details about the environment you are using? OpenSSL version, compiler, distro, etc.. And if you have Botan3. Can you try Botan 3 and see if the same thing is happening?

marty1885 avatar Nov 28 '23 04:11 marty1885

Custom built Linux BSP with Kernel 6.5.12 and OpenSSL 3.1.0. I cant change my TLS Provider because many other applications on my system depend on OpenSSL Compiler is GCC 12.2.1 armv7

tripleslash avatar Nov 28 '23 06:11 tripleslash

20231128 08:24:30.062488 UTC 4347 TRACE [newFileResponse] send http file:/var/www//ui/assets/images/loadingd80ae7c176ae41d8797c.png offset 0 length 0 - HttpResponseImpl.cc:296
20231128 08:24:30.063095 UTC 4347 TRACE [sendStaticFileResponse] Save in cache for 5 seconds - StaticFileRouter.cc:562
20231128 08:24:30.063734 UTC 4347 TRACE [renderToBuffer] reponse(no body):HTTP/1.1 200 OK
content-length: 51581
content-type: image/png
server: backend-api/0.1.0.6824
access-control-allow-origin: *
accept-range: bytes
expires: Thu, 01 Jan 1970 00:00:00 GMT
last-modified: Mon, 27 Nov 2023 15:39:30 GMT
date: Tue, 28 Nov 2023 08:24:30 GMT

 - HttpResponseImpl.cc:704
20231128 08:24:30.069186 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.086815 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.086909 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.093514 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.107780 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.107874 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.124704 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.155241 UTC 4347 TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215
20231128 08:24:30.155333 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:30.181652 UTC 4347 TRACE [sendFileInLoop] send file in loop - TcpConnectionImpl.cc:1133
20231128 08:24:30.223903 UTC 4347 TRACE [sendFileInLoop] send file in loop: return on loop exit - TcpConnectionImpl.cc:1234
20231128 08:24:33.879093 UTC 4347 TRACE [operator()] Erase cache - StaticFileRouter.cc:568
20231128 08:24:34.320739 UTC 4347 TRACE [recvData] Received 473 bytes from lower layer - OpenSSLProvider.cc:533
20231128 08:24:34.321107 UTC 4347 TRACE [processApplicationData] Received 451 bytes from SSL - OpenSSLProvider.cc:739

This is what happens when an incomplete file is transferred in the drogon.log with the highest log level. Maybe this can help you find the problem?

tripleslash avatar Nov 28 '23 08:11 tripleslash

Can you test your application in a normal linux x86 server?

an-tao avatar Nov 28 '23 08:11 an-tao

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215 this line always seems to happen when the issue appears

tripleslash avatar Nov 28 '23 08:11 tripleslash

TRACE [sendFileInLoop] send file in loop: break on socket buffer full (?) - TcpConnectionImpl.cc:1215 this line always seems to happen when the issue appears

You could modify that line to output the error number.

an-tao avatar Nov 28 '23 08:11 an-tao

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

marty1885 avatar Nov 28 '23 08:11 marty1885

Resource temporarily unavailable (errno=11) send file in loop: break on socket buffer full (?) - 11

errno = 11 = EAGAIN

tripleslash avatar Nov 28 '23 08:11 tripleslash

That is normal and expected. That sounds like the async socket is full and we are waiting for the OS to make the socket writable again. Hmm.... I agree this could solved by your proposal. But I really need to replicate the bug locally to fix it.

@tripleslash You should be able to switch the TLS provider. Other code can keep using OpenSSL. Trantor will simply use Botan3 for the underlying. Unless you are setting OpenSSL specific configuration commands.

It is quite some effort to add a new BSP build rule to build a completely new TLS lib. Apart from that I really would like to use the same TLS implementation across the entire system

tripleslash avatar Nov 28 '23 08:11 tripleslash

The issue seems to happen more frequently when you try to connect to the device from outside the intranet or over SSH proxies for example. In intranet it is hard to reproduce I agree.

Try to establish a connection from some remote network (maybe over your phone via LTE) and see if it happens for you. If not then try an ssh socks5 proxy.

The File also needs to be quite large. 100 KB should be enough.

tripleslash avatar Nov 28 '23 09:11 tripleslash

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

marty1885 avatar Dec 04 '23 13:12 marty1885

@tripleslash I've looked into the issue but it's baffling me. Is it possible for me to get access to a system that can replicate the issue?

Another weird thing that I noticed. The HTTP/1.1 response should not (and does not, according to the logs) cause the TCP connection to send. Thus not look like something got terminated early. Maybe this has something to do with threading...?

What type of access are you looking for? I could forward a https Port but cannot give you SSH access to the device. If thats good enough for you to see the issue in action then you can add me on Discord under xerabtw

tripleslash avatar Dec 04 '23 14:12 tripleslash

@tripleslash Sorry getting back to you late. Can you try the latest master? We fixed a bug that may be related to this bug report.

marty1885 avatar Dec 17 '23 03:12 marty1885

@marty1885 This issue still persists with the current master of trantor+drogon

tripleslash avatar Dec 18 '23 08:12 tripleslash

I think the PR may resolve this problem, @tripleslash would u please check it out?

an-tao avatar Dec 21 '23 06:12 an-tao

I think the PR may resolve this problem, @tripleslash would u please check it out?

I can confirm that the issue seems to be gone with the refactored trantor API ! @an-tao

tripleslash avatar Dec 21 '23 15:12 tripleslash

another update on this: i was able to still reproduce this error but it seems to happen much more rarely now with the trantor refactoring

tripleslash avatar Dec 21 '23 19:12 tripleslash

@tripleslash Thanks so much for your test on this. I added some debug logs to the destructor of the Tcp Connection Impl class (https://github.com/an-tao/trantor/pull/312/commits/6f854779ecaa96a43a249265045b263328d49e3f). please check it out. when the hanging occurs, disconnect from client, and find those logs on server, let's figure it out.

an-tao avatar Dec 22 '23 02:12 an-tao