shadowsocks-android icon indicating copy to clipboard operation
shadowsocks-android copied to clipboard

'Too many open files' on Android 6.0-8.1

Open imReker opened this issue 2 years ago • 8 comments

LocalDnsWorker.accept will throw Broken pipe when UDP is filtered or network is disconnected. And then, if DNS query continue incoming, the unix socks handle of VpnService process will exceeds handle limit which is 1024 (32768 on Android 9.0 and newer), so finally VpnService process will get exception Too many opened files and Bad file descriptor everywhere. Meanwhile, because Java side UDP DNS query is timeout, sslocal will send TCP DNS query with 'java protected' socket, which create same amount of socket handles in sslocal. (Why sslocal makes a TCP query again?) As a result, both VpnService and sslocal crash at random time.

Logs: org.shadowsocks.xx_issue_19bc73ad993aad4d5fe278892d584231_error_session_61279182004D00013C85A04AC568A81B_DNE_5_v2.log org.shadowsocks.xx_issue_274bc2d242720049275714683d3d4cc5_error_session_6127B31401C900010D2CF9C39D05D8E2_DNE_0_v2.log org.shadowsocks.xx_issue_2d5e1ddcbf72ff6f25953b540bd48ff5_error_session_6127C4D9026F00011AC0A04AC568A81B_DNE_0_v2.log

imReker avatar Sep 03 '21 14:09 imReker

Dumped File descriptor info, a DNS query from APP makes shadowsocks create at least 2 handle in VpnService, first one is local_dns_path of UDP query, and second is protect_path of TCP query.

fd list size = 928

fd list- 1bf: SOCK: socket:[23488393] UNIX / -- /
fd list- 1c0: SOCK: socket:[23472492] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1c2: SOCK: socket:[23490568] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1c3: SOCK: socket:[23478133] UNIX / -- /
fd list- 1c4: SOCK: socket:[23466596] UNIX / -- /
fd list- 1c6: SOCK: socket:[23478364] UNIX / -- /
fd list- 1c7: SOCK: socket:[23466600] UNIX / -- /
fd list- 1c8: SOCK: socket:[23468862] UNIX / -- /
fd list- 1c9: SOCK: socket:[23490569] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1ca: SOCK: socket:[23466603] UNIX / -- /
fd list- 1cb: SOCK: socket:[23478366] UNIX / -- /
fd list- 1cc: SOCK: socket:[23488399] UNIX / -- /
fd list- 1ce: SOCK: socket:[23466607] UNIX / -- /
fd list- 1cf: SOCK: socket:[23484823] UNIX / -- /
fd list- 1d1: SOCK: socket:[23476731] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d2: SOCK: socket:[23490570] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1d3: SOCK: socket:[23467117] UNIX / -- /
fd list- 1d5: SOCK: socket:[23476736] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d6: SOCK: socket:[23488418] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/protect_path -- /dev/socket/dnsproxyd
fd list- 1d7: SOCK: socket:[23480411] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1d8: SOCK: socket:[23479978] UNIX / -- /
fd list- 1d9: SOCK: socket:[23461630] UNIX / -- /
fd list- 1da: SOCK: socket:[23467746] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1db: SOCK: socket:[23479988] UNIX / -- /
fd list- 1dc: SOCK: socket:[23467121] UNIX / -- /
fd list- 1dd: SOCK: socket:[23467126] UNIX / -- /
fd list- 1de: SOCK: socket:[23467748] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1df: SOCK: socket:[23480413] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e0: SOCK: socket:[23466611] UNIX / -- /
fd list- 1e1: SOCK: socket:[23479119] UNIX / -- /
fd list- 1e2: SOCK: socket:[23480797] UNIX / -- /
fd list- 1e3: SOCK: socket:[23478626] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e4: SOCK: socket:[23480440] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e5: SOCK: socket:[23466617] UNIX / -- /
fd list- 1e6: SOCK: socket:[23477117] UNIX / -- /
fd list- 1e7: SOCK: socket:[23432212] UNIX / -- /
fd list- 1e8: SOCK: socket:[23480445] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
fd list- 1e9: SOCK: socket:[23480447] UNIX /data/user_de/0/org.shadowsocks.xx/no_backup/local_dns_path -- /dev/socket/dnsproxyd
.................

imReker avatar Sep 03 '21 15:09 imReker

These seem normal. These fds are not closed? Does your server connection work properly?

Mygod avatar Sep 05 '21 04:09 Mygod

These seem normal. These fds are not closed? Does your server connection work properly?

Most of them will be closed, or crash because of Too many open files. In some terrible network environment, UDP packet loss rate can be very high, and this issue would be triggered.

Key point of this issue is ifferent timeout in Java and Rust side: LocalDnsWorker use Java's getAllByName, it's timeout is defined by system, usually 90 seconds. But in Rust side, the timeout is 5 seconds. So, when the network is very slow or UDP filtered, a DNS query send to Java side, it will wait for system I/O until 90s timeout, but Rust side will fail in 5s and return the fail result to App who made the DNS request. And App will then request again, if App retry without any interval and count limit, LocalDnsWorker.accept will create thousands socket FD in this 90 seconds.

But I still don't know why socket of protect_path is leaked too.

imReker avatar Sep 06 '21 02:09 imReker

It is technically not a leak if they are eventually closed? Although I am down to tweak timeouts. Where did you find the 90s timeout?

Mygod avatar Sep 06 '21 16:09 Mygod

90s timeout is an experience value by the log, it's not accurate. Though, it's not a traditional 'leak', but I think it is still an issue because of the different timeout and the thousands DNS retries it caused. Maybe 'deny of service' is more accurate? Currently, to solve this issue, I set a counter in LocalDnsWorker.accept, when pending DNS queries over 200, the accept just return an empty response to sslocal (this limit could be done in sslocal either). I think correct method to fix this issue is replace getAllByName by dnsjava, which can set a timeout on query. But we need modify it and makes caller can set a Network for it to create socket.

imReker avatar Sep 06 '21 17:09 imReker

Sounds good. I will take a look sometime.

Does this issue go away if you use the "All" Route?

Mygod avatar Sep 07 '21 03:09 Mygod

Currently I use ACL with Bypass Lan. I think this issue doesn't exists in 'All' route case since DNS query will not be passed to Java side (so no extra FDs created) and it has 5s timeout.


And, maybe unix socket connection reuse ( ref #2751 ) is still needed? Because rust will make 2-3 DNS queries for 1 connection, there still has very little chance to create over 1000 FDs before the 5s timeout.

imReker avatar Sep 07 '21 06:09 imReker

@Mygod I modified a little code of dnsjava(mainly Network related works and Java8/Android adaptation) and it works! Only downside is dnsjava 3.4.1 doesn't support Android 6.x because of Java NIO. (Old version support Android 6.x but it use blocking socket, so may still result same issue) I'll perform a stress test again tomorrow.

imReker avatar Sep 09 '21 16:09 imReker

Closing as Android versions too old.

Mygod avatar Dec 20 '22 16:12 Mygod