SSLproxy icon indicating copy to clipboard operation
SSLproxy copied to clipboard

SSLproxy exits with error after some time, fd leak?

Open victorjulien opened this issue 2 months ago • 19 comments

I see SSLproxy (bd43af7a1017fb9ba509e339003c39106cd14004) run well for a while, when exit with error. Messages suggest it runs out of file descriptors suggesting they get leaked somewhere.

sudo ./src/sslproxy -D -l connections.log -j /tmp/sslsplit/ -S logdir/ -k ../sslsplit/ca.key -c ../sslsplit/ca.crt -Y pcaps/ -oVerifyPeer=no autossl 192.168.0.1 8465 up:8080 autossl 192.168.0.1 8143 up:8080 autossl 192.168.0.1 8110 up:8080
[...]
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
Error 24 on listener: Too many open files
SSL_free() in state 00000001 = 0001 = SSLOK (SSL negotiation finished successfully) [accept socket]      
SSL_free() in state 00000001 = 0001 = SSLOK (SSL negotiation finished successfully) [connect socket]     
Garbage collecting caches started.       
Garbage collecting caches done.          
Connecting to [142.250.102.109]:995      
CONN: autossl 192.168.0.5 36392 142.250.102.109 995 user:-                                               
AUTOSSL connected to [142.250.102.109]:995
AUTOSSL connected from [192.168.0.5]:36392                                                               
Received privsep req type 01 sz 94 on srvsock 14                                                         
Checking for a client hello              
Peek found ClientHello                   
Received privsep req type 01 sz 94 on srvsock 16
Attempt reuse dst SSL session                                                                            
SSL_free() in state 00000001 = 0001 = SSLOK (SSL negotiation finished successfully) [connect socket]     
Garbage collecting caches started.        
Garbage collecting caches done.                 
Connecting to [142.250.102.108]:995      
CONN: autossl 192.168.0.5 42568 142.250.102.108 995 user:-                                               
AUTOSSL connected to [142.250.102.108]:995      
AUTOSSL connected from [192.168.0.5]:42568
Received privsep req type 01 sz 94 on srvsock 14                                                         
Checking for a client hello              
Peek found ClientHello                   
Received privsep req type 01 sz 94 on srvsock 16
Attempt reuse dst SSL session                                                                            
Opening pcapdir file '/home/victor/build/SSLproxy/pcaps/20251004T152451Z-192.168.0.5,42568-142.250.102.108,995.pcap' failed: Success (0)                                                                           
Error 0 on logger: Success                
Main event loop stopped (reason=-1).            
Error building tcp header: (null)Error building packet                                                   
Failed to write to pcap log: Success (0) 
Error 0 on logger: Success                      
Received privsep req type 00 sz 1 on srvsock 8
Received privsep req type 00 sz 1 on srvsock 14                                                                                                                                                                    
Received privsep req type 00 sz 1 on srvsock 12
Received privsep req type 00 sz 1 on srvsock 16
Child pid 75442 exited with status 1                                                                     

victorjulien avatar Oct 05 '25 07:10 victorjulien

The error message strongly suggests resource exhaustion (running out of available FDs) rather than a simple crash. We try to mitigate this specific type of crash during connection acceptance by comparing the descriptor table count against the available limit, as detailed in this TODO task. Note the difficulties I specifically mention there regarding the logging subsystem.

I suspect the crash you saw is possible with the -Y option, especially if SSLproxy was already very close to the system's open files limit (1024 by default). I've frankly never run the -Y option under actual production load. We clearly need to implement the same resource check mechanism in the logging subsystem that the proxy core uses.

While we've used static analysis tools and Valgrind, I wouldn't entirely rule out the possibility of a genuine fd leak as you suspect, particularly in the logging subsystem.

To help narrow this down, could you provide some context on the crash?

  1. Load: Was SSLproxy under heavy load (too many connections relative to the fd limit) when this happened?
  2. Limit: Do you know if SSLproxy was already very close to the fd limit?
  3. Reproducibility: Does this only happen when the -Y option is enabled?

For better diagnostics, I'd appreciate it if you could:

  • Enable the DEBUG_PROXY switch in Mk/main.mk, recompile, and start SSLproxy with the -D4 option. This verbose logging should include current fd usage too.
  • Provide a coredump, if available (but sslproxy simply exits here, doesn't it?).

In the meantime, I'll review the logging subsystem source code yet again to see if we can prevent these fd-related crashes, similar to the proxy core logic. Until we find a definitive fix, please increase the fd limit on your system using the OpenFilesLimit option.

Thanks for reporting.

sonertari avatar Oct 05 '25 08:10 sonertari

@victorjulien I have just pushed a change on the check_fd_usage_before_content_logging branch, which is expected to prevent the issue you have reported. I hope this helps you continue with your work. Can you test and report please?

Btw, this is yet another issue with using the PCAP options in SSLproxy to feed decrypted traffic to an IDS, page 5 of my presentation, which I should have mentioned in the webinar.

sonertari avatar Oct 05 '25 16:10 sonertari

Haven't seen the issue in the last 2 days while running the check_fd_usage_before_content_logging branch.

My load is very low. Just a single pop3 poll job every minute or so. I haven't tried your other suggestions yet. Can look into those if that is useful.

victorjulien avatar Oct 08 '25 07:10 victorjulien

Great, if that change helps you continue your work, no need to debug any further. But it's strange that you had this issue with very low load. Btw, I ran the e2e tests with the -Y option enabled and using valgrind, it does not report any fd leak. Still strange though.

sonertari avatar Oct 08 '25 07:10 sonertari

I just saw the issue again, while running the check_fd_usage_before_content_logging branch. Will look at your other suggestions when I find some time.

victorjulien avatar Oct 09 '25 11:10 victorjulien

Another possibility for this out-of-fds issue is persistent connections, but I don't think your pop3 connections are persistent (or are they for some reason?). In any case, to prevent such resource exhaustion, SSLproxy terminates persistent connections if they are idle for 120 seconds by default. But if they break their silence even once during that 120 seconds period, they do not expire but remain in the connection list, hence continue consuming fds, which may lead to the out-of-fds issue after a few days.

Yet another possibility I have seen with new users is related with how they divert packets to sslproxy, which causes infinite redirection loops. I don't remember the exact details, but it usually happens almost immediately after sslproxy starts. So I don't think this is the case with you.

Have you increased the OpenFilesLimit value in SSLproxy, or is it still 1024? Do you still use the -Y option?

Also, is this the same issue where sslproxy exits, or does sslproxy reports that it's out of fds, but continues? The later is the intended outcome of the change in the check_fd_usage_before_content_logging branch: It is supposed to mitigate the out-of-fds issue, not prevent it.

Yes, please do check the verbose logs with the -D4 option. You are welcome to send them to me, if possible (I'd like to solve this issue, if it's due to sslproxy).

sonertari avatar Oct 09 '25 12:10 sonertari

It's effectively the same output as above, with exactly the same config/commandline. Just running the check_fd_usage_before_content_logging branch. Will try the other suggestions next.

victorjulien avatar Oct 09 '25 12:10 victorjulien

Saw the same thing again w/o the -Y pcaps option. Trying w/o -S next.

victorjulien avatar Oct 15 '25 08:10 victorjulien

I haven't seen this anymore after dropping the -S option.

victorjulien avatar Oct 27 '25 18:10 victorjulien

I think I have found the root cause of the fd leak. And I think it's not the logging subsystem (unless there is another fd leak there).

If you run the following command (but make sure you use the pid of the privsep client running as the regular user victor):

sudo lsof -p <PID of sslproxy privsep client>

You will see at least a couple of stale connection like this:

sslproxy 771707 nobody  326u     IPv4           22742197      0t0       TCP localhost:51082->localhost:http-alt (ESTABLISHED)

These are orphaned connections from sslproxy to the listening program (lp), and the source of the fd leak.

Now, if you terminate the lp program, you should see that sslproxy crashes too, with signal 11 or an assertion failure.

Can you please confirm the above for me?

sonertari avatar Oct 28 '25 05:10 sonertari

Indeed. There are 157 such entries currently.

Also correct on the crash. If I close lp, sslproxy crashes. Since I have asan enabled it shows the same asan error I shared privately before (sadly my attempt of getting symbols failed).

victorjulien avatar Oct 28 '25 06:10 victorjulien

@victorjulien, can you test the last change I have just pushed please?

I am positive that it fixes the fd leak, because I was able to reliably reproduce the issue here as well.

sonertari avatar Oct 28 '25 09:10 sonertari

Just pulled in your latest branch and did a quick test with killing lp. I can confirm that it doesn't crash sslproxy. I will let this run for a while now and let you know how it goes.

victorjulien avatar Oct 29 '25 21:10 victorjulien

I closed my last client ~8 hours ago, but still seeing

Every 60.0s: lsof -p 24598|grep localhost                                                                                                                                           c2758: Thu Oct 30 07:10:00 2025

sslproxy 24598 nobody   48u     IPv4             988306      0t0      TCP localhost:45382->localhost:http-alt (CLOSE_WAIT)
sslproxy 24598 nobody   49u     IPv4             989229      0t0      TCP localhost:39324->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   50u     IPv4             987520      0t0      TCP localhost:35170->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   51u     IPv4             986965      0t0      TCP localhost:39356->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   52u     IPv4             990253      0t0      TCP localhost:57308->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   53u     IPv4             991335      0t0      TCP localhost:42868->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   54u     IPv4             992385      0t0      TCP localhost:58932->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   55u     IPv4             990765      0t0      TCP localhost:40636->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   56u     IPv4             993745      0t0      TCP localhost:48242->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   57u     IPv4             990984      0t0      TCP localhost:48258->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   58u     IPv4             991228      0t0      TCP localhost:46272->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   59u     IPv4             993107      0t0      TCP localhost:38912->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   60u     IPv4             997435      0t0      TCP localhost:38448->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   61u     IPv4             994954      0t0      TCP localhost:34914->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   62u     IPv4             997900      0t0      TCP localhost:42788->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   63u     IPv4             999749      0t0      TCP localhost:55892->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   64u     IPv4             999981      0t0      TCP localhost:44068->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   65u     IPv4            1000103      0t0      TCP localhost:47062->localhost:http-alt (ESTABLISHED)
sslproxy 24598 nobody   66u     IPv4            1002031      0t0      TCP localhost:54732->localhost:http-alt (ESTABLISHED)

If I understand correctly these are connections between sslproxy and lp. I had expected to see 0 here within minutes after closing my email client.

victorjulien avatar Oct 30 '25 06:10 victorjulien

This was not supposed to happen. So:

  1. The output you have posted shows that sslproxy runs as nobody, but your commandline in your first post does not configure for that. Aren't you supposed to have victor as the sslproxy user, or have you switched to nobody as sslproxy user?

  2. You did not say if closing lp crashes sslproxy too, and if that really doesn't happen anymore, then those orphaned connections do not belong to the current instance of sslproxy. Is it possible that you have an old instance of sslproxy running?

  3. Are you sure you are using my last commit? (Sanity check, sorry.)

Note that closing lp should definitely crash sslproxy with socket-based fd leaks, because lp sends final RST/FIN packets to sslproxy, which triggers eof events for the orphaned bufferevents of those sockets, but all memory of the connections who created those bufferevents have been freed already, which causes all sorts of memory access violations.

Since I am pretty sure my last commit fixes that, I would say there is perhaps another similar issue. But I am confused due to my second question above.

sonertari avatar Oct 30 '25 07:10 sonertari

Apologies, I realized now that I did pull your code but forgot to build it ;-(

victorjulien avatar Oct 30 '25 11:10 victorjulien

Quick update: SSLproxy+lp+my suricata branch have been running great in the last week with check_fd_usage_before_content_logging (SSLproxy v0.9.8-18-gc43ce62 (built 2025-10-30)). No crashes or other issues.

victorjulien avatar Nov 05 '25 20:11 victorjulien

Quick update: SSLproxy+lp+my suricata branch have been running great in the last week with check_fd_usage_before_content_logging (SSLproxy v0.9.8-18-gc43ce62 (built 2025-10-30)). No crashes or other issues.

Excuse me, is it convenient for you to share more information about the solution of running Suricata and sslproxy together?

delsadan avatar Nov 20 '25 04:11 delsadan

I know the question is directed to @victorjulien, but here are a couple of links to relevant info, for anyone interested:

Note that Suricon 2025 is ongoing at the moment.

sonertari avatar Nov 20 '25 07:11 sonertari