shadowsocks-org
shadowsocks-org copied to clipboard
[SIP] Authentication based multi-user-single-port
Background
Previous discussions suggest that we do authentication (SIP004) on the first chunk with different keys, to identify the user based on the success key.
Implementation Consideration
Performing GCM/Poly1305 on the first chunk should be very fast. It's expected that even a naive implementation would support thousands of users without any notable overhead.
Still, we can cache the success keys for its source IP, which would save most of computation. To prevent potential DDOS attack, the IP that tries too many times with authentication failure should be blocked.
Given this SIP doesn't involve any protocol change, only server code needs to be modified. The only limitation here is that AEAD ciphers are required.
Example
Jigsaw implemented a go-ss2 based server here: https://github.com/Jigsaw-Code/outline-ss-server. Early report shows that it works quite well with 100 users: https://github.com/shadowsocks/shadowsocks-org/issues/128#issuecomment-415810597
Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.
Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.
May be that's what we called, THE COST :)
Things cannot be perfect. It's depend on a BALANCE.
-
Do not support multi users run on a single port (I mean really multi, e.g. 100 users) => multi ports should be opened <= It's a abnormal server side behavior.
-
Multiple users mass in a single port, oh yes, it's cool!
And, it looks some kind "clean" from server side. The operator of SSP, shadowsocks service provider, must buy your beer.
And * 2, ss-manager should, maybe, retired.
That's just some personal comment. This SIP need more balance in any case.
More words:
If a shadowsocks server supports only countable user/users, it's abnormal behavior too.
--
Following on this idea, maybe a later SIP should about exchange shadowsocks in a kind of circle (known friends? trusted servers?)
Hmm if you're okay with the COST that users one day complaining to you it's not working because of NAT v.s. cache issue.
I suggest either not taking the cache approach, or use other protocols that already supports multi-user like v2mess (I haven't looked at the protocol yet but it seems that that protocol supports this use-case).
Different people prefer different balance between things. I don't think Shadowsocks is intended to cover all kind of balances you wish.
Hmmm… I think if we're gonna officially support multiuser per port, we might as well address the problem cleanly? #54 is still open ^_^
But I agree this hack is neat in that it does not require any changes in the clients. 👍
Also I should point out that the problem I pointed out might occur more frequently than you imagine thanks to exhausted IPv4 pool and widely-deployed CGN. It's likely that one will run into such frustration despite having taken precautions.
CGN is a major concern. We might need to run some tests to determine the rough size of NAT pools used by ISPs doing massive CGN.
NAT should not be a problem, as long as not all of the users are behind the same NAT address.
Say five users behind a same NAT ip address, at most five keys cached for that IP.
This SIP just suggests a kind of multi-user-single-port solution for shadowsocks without modifying the protocol.
But as mentioned by @Mygod, shadowsocks is not designed for this purpose.
I listed this SIP here since it's already implemented in a third-party software. If anyone is interested in it as well, please follow this SIP and apply the suggested optimizations.
My worry is that people will eventually abuse this hack to run commercial services. It's not gonna scale well when users are mostly behind CGN with small pool of public IPs, e.g. mobile networks in China.
CGN also applies to ADSLs. Also one shouldn't forget NAT routers in enterprises, schools, etc. A good way to combat this is to enlarge cache size and always do a fallback lookup.
Fallback lookup is always needed. Even a key is cached, the authentication is still required. If failed for authentication, a fallback lookup is performed.
I don't expect millions of users on one single port. A reasonable assumption is thousands of users per server, hundreds per port.
And of course, it cannot scale for commercial usage.
In some places, the ISP may do the NAT for entire neighborhood which may include 10,000 end users by assigning the ip address with prefix 100.64
. It is also a kind of NAT.
https://tools.ietf.org/html/rfc6598
- IANA Considerations
IANA has recorded the allocation of an IPv4 /10 for use as Shared Address Space.
The Shared Address Space address range is 100.64.0.0/10.
@celeron533 This is CGN mentioned above.
Hmm, why not use a ElGamal-like method to identify users?
Compatibility.
FYI, Outline Servers have all been migrated to outline-ss-server this week. They don't yet use the single port feature, but we intend to enable it in a few weeks, after I implement the IP->cipher cache.
We can roll that out gradually and see how it performs in the wild. In my own tests, the added latency for 100 users without any optimization in a crappy $5 VPS can be significant, 10s of milliseconds, but it can vary wildly, and I believe the optimizations will help significantly. Also, outline-ss-server has Prometheus metrics, so we will be able to expose latency metrics and admins will be able to monitor that.
BTW, outline-ss-server still allows for multiple ports, and you can have multiple keys per port, and multiple ports per key. You can always start a new port if one becomes overloaded. One nice feature is that you can do that without creating a new process for each port, or stop the running one.
It's worth mentioning that the single-port feature has some very good motivation:
- It makes it a lot easier and safer to configure your server firewall. No need to open all the ports.
- It allows all servers to run on ports 443, 80 or any other usually unblocked port. We found multiple cases of users not being able to use Outline in strict networks that doesn't allow traffic to high port numbers, or outside a small subset of ports.
- It allows Outline Servers to run on a Docker container without needing --net=host (you can expose the single port instead)
- In the future, we'll be able to run the Outline Server management API and the Shadowsocks service on the same port, by making it fallback to HTTPS to the management API if all keys fail. This will make the servers even harder to detect (you'll get a standard 404).
I now have a benchmark for my single-port implementation: https://github.com/Jigsaw-Code/outline-ss-server/pull/7
These are the results on a $5 Frankfurt DigitalOcean machine that is idle:
BenchmarkTCPFindCipher 1000 1304879 ns/op 2015027 B/op 3107 allocs/op
BenchmarkUDPUnpack 3000 615077 ns/op 115427 B/op 1801 allocs/op
That's 1.3ms to go over 100 ciphers for a TCP connection. 0.6 ms for a UDP datagram. That will probably be worse under load, but it gives an idea of the kind of added latency we'd be adding.
There's 2MB of allocations for one TCP connection. I believe that can be significantly reduced by sharing buffers, but it gets a little tricky with the code structure and different ciphers needing different sizes of buffers (I guess I need to find the max buffer size).
@fortuna That's a lot of allocs/op. Is that normal?
PR https://github.com/Jigsaw-Code/outline-ss-server/pull/8 makes the TCP performance on par with UDP. We no longer allocate so much memory:
BenchmarkTCPFindCipher-12 1000 1349922 ns/op 125278 B/op 1705 allocs/op
BenchmarkUDPUnpack-12 2000 881121 ns/op 125030 B/op 1701 allocs/op
The ~2MB allocations were because I was allocating a buffer for an entire encrypted chunk (~16KB) for each of the 100 ciphers I tried. Now I allocate only one buffer for all ciphers
As for the number of allocations, it's just that' I'm doing the operation 100 times. For 1 cipher only I get these numbers:
BenchmarkTCPFindCipher-12 30000 52329 ns/op 1408 B/op 22 allocs/op
BenchmarkUDPUnpack-12 200000 8989 ns/op 1266 B/op 18 allocs/op
With the new findAccessKey
optimization, the allocations and CPU are dominated by the low level crypto, so I'm not sure there's much room to improve there:
This is without the IP -> cipher cache. I'm trying to make the cipher finding as efficient as possible, to reduce the need for the cache.
FYI, I've added an optimization to outline-ss-server that will keep the latest used cipher in the front of the list. This way the time to find the cipher is proportional to the number of ciphers being used, rather than the total ciphers.
Furthermore, I've added the shadowsocks_time_to_cipher_ms
metric that will tell you the 50th, 90th and 99th percentile times to find the cipher for each access key.
This should be enough to inform us whether the performance is good enough. It would be great if people here gave it a try and reported back. The lastest binary with the changes is v1.0.3 and can be found in the releases: https://github.com/Jigsaw-Code/outline-ss-server/releases
Update: Outline has been running servers with multi-user support on a single port for a few months now. Some organizations have 300 keys on a server, with over 100 active on any given day. Median latency due to cipher finding is around 10ms and CPU usage is minimal (bandwidth is the bottleneck).
At 90th percentile you can see cases here and there close to 1 second, but that's not common and may be due to other factors such as a burst in CPU usage (maybe expensive prometheus queries).
Has anyone here tried the single port feature? How was your experience?
Average 10ms latency looks too slow to me.
Assuming 300 users and the worst case that 300 authentications performed for each connection, one single authentication takes 33us. It means more than 33k cycles on a 1 GHz CPU, which is too long for a small packet authentication.
Can you elaborate more about the measurement of latency?
2998 light-kilometer might or might not be acceptable depending on use case, e.g. it's probably not acceptable for game streaming but probably ok for downloading/video streaming. :smile:
This site says that 20ms is excellent RTT. So 10ms shouldn't be perceptible.
Also, this is latency added per connection, not per packet.
How about UDP connections/packets (which are mostly used in latency-sensitive applications)?
I have a benchmark above: https://github.com/shadowsocks/shadowsocks-org/issues/130#issuecomment-447063760
UDP takes about 9 microseconds per cipher.