swift-nio too big CPU load & event loop stall on massive socket close

Expected behavior

[what you expected to happen] no stalls on socket close

Actual behavior

[what actually happened] stalls for 10-30 seconds, up to disconnect by timeout

Steps to reproduce

video: https://yadi.sk/i/ZmAu8La5zLWfSg. ( to look in the best quality you can download file) sources: https://github.com/allright/swift-nio-load-testing/tree/master/swift-nio-echo-server commit: https://github.com/allright/swift-nio-load-testing/commit/a461c72f2adce2e6fabbb981307166178ac2e397

VPS: 1 CPU 512 RAM ubuntu 16.0.4

download and compile swift-nio-echo-server (based on NIOEchoServer from swift-nio)
compile release & run
connect to server manually by telnet client
run tcpkali -c 20000 --connect-rate=3000 --duration=10000s --latency-connect -r 1 -m 1 echo-server.url:8888
wait until server will have > 15000 connections
during wait type in telnet & look echo response immediately
stop tcpkali by Ctrl+C
type in telnet & DO NOT RECEIVE ANY RESPONSE!
Wait some time 10...30 seconds, until all connections will be closed by timeout
Type in telnet & have echo response immediately (sometimes telnet may be disconnected by timeout 30 sec in code)

root@us-san-gate0:~/swift-nio-load-testing/swift-nio-echo-server# cat Package.resolved { "object": { "pins": [ { "package": "swift-nio", "repositoryURL": "https://github.com/apple/swift-nio.git", "state": { "branch": "nio-1.13", "revision": "29a9f2aca71c8afb07e291336f1789337ce235dd", "version": null } }, { "package": "swift-nio-zlib-support", "repositoryURL": "https://github.com/apple/swift-nio-zlib-support.git", "state": { "branch": null, "revision": "37760e9a52030bb9011972c5213c3350fa9d41fd", "version": "1.0.0" } } ] }, "version": 1 }

Swift version 4.2.3 (swift-4.2.3-RELEASE) Target: x86_64-unknown-linux-gnu Linux us-san-gate0 4.14.91.mptcp #12 SMP Wed Jan 2 17:51:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

PS. the same echo server, but implemented on C++ ASIO, has not such problem. Can apply source codes(C++) & video if needed

Mar 10 '19 20:03 allright

I have just profiled whats happening on the "stall moment"

You can open perf-kernel.svg in any Browser to look performance Graph Perf.zip

Too much objects release in the same moment blocks Event Loop. Can we fix it? Workarounds:

Is it possible to schedule 50% of event loop time to handle all events except releasing objects, and 50% for other tasks? May be we need something like Managed GarbageCollector (or " smooth release objects manager" may be thing like DisposeBag in RxSwift ? )
Schedule channel release at random time after closing socket from client?
Closing 25000 connects in one thread cause 30 seconds hang, but if I make 4 EventLoops - telnet hangs only for 7.5 second. So not more 6000 connections per event loop is possible.

Tools used for perf monitoring: http://www.brendangregg.com/perf.html http://www.brendangregg.com/perf.html#TimedProfiling

2019-03-11_10-12-10

Mar 11 '19 07:03 allright

ouch, thanks @allright , we'll look into that

Mar 11 '19 10:03 weissi

One more possible design - is provide the FAST custom allocator/deallocator (like in std C++) for promises. Which really have preallocated memory & not really calls malloc/free every time object deallocated or calls it one time for big group of objects. So my idea is group allocations/deallocations 1 alloc for 1000 promises, or 1 alloc/dealloc per second. So we can attach this custom allocator/deallocator to each EventLoop.

Another possible design - is object reuse pool. Really, it can preallocate many needed objects at the app start & deallocate it only on app stop. Or manage it automatically. Real server application usually tuned on the place for maximum possible connections/speed - so we do not need real retain/dealloc during app life (just only on start/stop).

@weissi What do you think?

Mar 12 '19 03:03 allright

@allright Swift unfortunately doesn’t let you choose the allocator. It will always use malloc. Also from your profile it seems to be the reference counting rather than the allocations, right?

Mar 12 '19 09:03 weissi

@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.

Mar 12 '19 09:03 allright

@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.

but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this

func someFunction(_ foo: MyClass()) { ... }

let object = MyClass()
someFunction(object)
object.doSomething()

then the Swift compiler might emit code like this:

let object = MyClass() // allocates it with reference count 1
object.retain() // ref count + 1, to pass it to someFunction
someFunction(object)
object.retain() // ref count + 1, for the .doSomething call
object.doSomething()
object.release() // ref count - 1, because we're out of someFunction again
object.release() // ref count - 1, because we're doing with .doSomething
object.release() // ref count - 1, because we no longer need `object`

certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools.

Mar 12 '19 10:03 weissi

Yes. Not all. But for example, Channel Handlers may be allocated/deallocated using factory

let chFactory = currentEventLoop().getFactory() // or createFactoryWithCapacity(1000) let channelHandler = chFactory.createChannelHandler() // real allocation here (or get from preallocated)

// use channelHandler chFactor.release(channelHandler) // ask chFactory that this channelHandler may be reused // no release or retain here! let reusedChannelHandler = chFactory.createChannelHandler() // reinited channel handler

So - use this way for each object like Promise and etc..

Mar 12 '19 12:03 allright

@allright sure, you could even implement this today for ChannelHandlers. The problem is the number of reference count operations will be the same.

Mar 12 '19 12:03 weissi

Yes, the number of operations is the same, but the moment of operations is not the same. We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).

Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.

Mar 12 '19 12:03 allright

Yes, the number of operations is the same, but the moment of operations is not the same.

That's not totally accurate. If you take out a handler from a pipeline, there will be reference counts changed, whether that handler will be re-used or not. Sure, if they are re-used, then you don't need do deallocate which causes even more reference count decreases.

We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).

Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.

Totally agreed. I'm just saying that caching your handlers (which you can do today, you don't need anything from NIO) won't remove all reference count traffic when tearing down the pipeline.

Mar 12 '19 12:03 weissi

I see. Let's try to fix what we can and test!) Even preventing massive deallocations will improve performance.

Mar 12 '19 12:03 allright

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

So lets optimise alloc/dealloc speed by reuse pools or any other way.

Mar 12 '19 12:03 allright

I see. Let's try to fix what we can and test!)

What you could do is store a thread local NIOThreadLocal<CircularBuffer<MyHandler>> on every event loop. Then you can

let threadLocalMyHandlers = NIOThreadLocal<CircularBuffer<MyHandler>>(value: .init(capacity: 32))
extension EventLoop {
    func makeMyHandler() -> MyHandler {
        if threadLocalMyHandlers.value.count > 0 {
            return threadLocalMyHandlers.value.removeFirst()
        } else {
            return MyHandler()
        }
    }
}

and in MyHandler:

func handlerRemoved(context: ChannelHandlerContext) {
    self.resetMyState()
    threadLocalMyHandlers.value.append(self)
}

(code not tested or compiled, just as an idea)

Even preventing massive deallocations will improve performance.

agreed

Mar 12 '19 12:03 weissi

good idea) will test later)

Mar 12 '19 12:03 allright

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

__swift_retain_...HeapObject is also just in increment of the reference count. Allocation is swift_alloc, swift_allocObject and swift_slowAlloc.

The reason HeapObject is in the symbol name of __swift_retain...HeapObject is because it's written in C++ and in C++ the parameter types are name-mangled into the symbol name.

Mar 12 '19 13:03 weissi

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

__swift_retain_...HeapObject is also just in increment of the reference count. Allocation is swift_alloc, swift_allocObject and swift_slowAlloc.

hm ...

Mar 12 '19 13:03 allright

CircularBuffer<MyHandler>

I have just tested this, but it is not enough (a lot of Promises cause retain/release), and these promises must be reused too. But I figured out that stalls happen while massive handlerRemoved function called. So I think the best solution will be to automatically distribute in time invokeHandlerRemoved() ... calling. It must be not > 100 invokeHandlerRemoved() invokes per second (for example) - depends of CPU performance. May be add special deferred queue for call invokeHandlerRemoved() ??? It will be smart garbage collector per EventLoop. @weissi is it possible to apply this workaround?

Mar 13 '19 10:03 allright

handlerRemoved is invoked in a separate event loop tick by way of a deferred event loop execute call. Netty provides a hook to limit the number of outstanding tasks that execute in any event loop tick. While I don't know if we want the exact same API, we may want to investigate whether we should provide tools to prevent scheduled tasks from starving I/O operations.

Mar 13 '19 10:03 Lukasa

"limit the number of outstanding tasks that execute in any event loop tick" Yes, EventLoop mechanics means that every operation is very small. And only prioritisation can help in this case. I think it is good Idea. Two not dependent ways for optimise:

Reuse objects (all promises and channel handlers myst be reused to prevent massive alloc/dealloc)
Prioritisation (one of possible implementations is limiting not hi priority tasks per one tick).

Mar 13 '19 10:03 allright

In Real world: We have limited resources on server. Simple example 1 CPU Core + 1 GB Ram (it may covers up to 100000 tcp connection or 20000 ssl). So real server will be tuned and limited for maximum connections due to RAM & CPU limitations. And.....

Server do not need dynamic memory allocation/deallocation during processing. Swift-nio pipeline:

EchoHandler() -> BackPressureHandler() -> IdleStateHandler() -> ... some other low level handlers like TCP and etc... We can preallocate and reuse 100000 Pipelines with All they needs, not only Handlers, but all Promises too:

EchoHandler: 100000 BackPressureHandler: 100000 IdleStateHandler: 100000 Promise: 10* 100000 = 1000000

It completely solves our problem - no massive allocations/deallocations during processing.

Possible steps to implement:

Move the ownership of all Promises to the common base ChannelHandler class.
Make the factory interface for creating & reiniting ChannelHandlers in ReusePool. May be easier will be reuse the ChannelPipeline object (I'm still not have deep diving into source codes yet).

P.S. I faced with slow accepting of incoming TCP connections in comparison with C++ boost::asio. So I think the reason is slow memory allocation.

Mar 14 '19 06:03 allright

I have gotten an issue using the Vapor based on the SwiftNIO (https://github.com/vapor/vapor/issues/1963) I guess it belongs to this issue. Does any workaround exist?

Apr 23 '19 15:04 AnyCPU

@AnyCPU your issue isn't related to this.

Apr 23 '19 15:04 weissi

@weissi is it related to SwiftNIO?

Apr 23 '19 16:04 AnyCPU

is it related to SwiftNIO?

I don't think so but we'd need more information to be 100% sure. Let's discuss this on the Vapor issue tracker.

Apr 23 '19 17:04 weissi

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing. I think we also have a problem on massive accepting a lot of new incoming tcp connections in comparison with ASIO (cpp library).

Apr 23 '19 17:04 allright

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.

Your graph above shows that most of the overhead is in retain and release. That would not go away if we pre-allocated.

Apr 23 '19 17:04 weissi

@AnyCPU your issue isn't related to this.

I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)

Apr 23 '19 17:04 allright

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.

Your graph above shows that most of the overhead is in retain and release. That would not go away if we pre-allocated.

malloc, retainCount++. 0->1. (a lot of time)
retainCount++ 1->2
retainCount++ 2->3 ....
retainCount-- 3->2
retainCount-- 2->1
retainCount--, free() 1->0 (a lot of time)

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

Could you test this hypothesis?

Apr 23 '19 17:04 allright

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

It's an atomic increment and decrement. Just check your own profile, the ZN14__swift_retain_... is just an atomic ++

Apr 23 '19 17:04 weissi

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

It's an atomic increment and decrement. Just check your own profile, the ZN14__swift_retain_... is just an atomic ++

could you provide link to implementation?

Apr 23 '19 17:04 allright

swift-nio swift-nio copied to clipboard

too big CPU load & event loop stall on massive socket close

Expected behavior

Actual behavior

Steps to reproduce

swift-nio
swift-nio copied to clipboard