High cpu_system with many open connections
Reincarnation of https://github.com/grobian/carbon-c-relay/issues/216
On heavy-load relays (15 000 persistent connections, 25 millions metric/minute) we have some problems - timeout when established new connections and very high CPU usage.
Current dispatcher code have some problems:
- read from all sockets (without poll)
- full scan connections table when new connections established
- long duration for lock when connections table resize (need realloc for memory buffers)
I do some refactoring of carbon-c-relay dispatcher code (switch dispatcher to libevent and refactor connections table). This reduce cpu and memory usage and we can process more connections on some hardware.
Some tests from my workstations (2000 connections, random delay 50-500 ms):
-
master TCP.CONNECT.OK 14807 (242/s) TCP.CONNECT.TIMEOUT 498811 (8177/s) TCP.SEND.OK 678226 (11118/s) TCP.SEND.RESET 7772 (127/s)
-
Version from https://github.com/msaf1980/carbon-c-relay/tree/libevent_pthread TCP.CONNECT.OK 21049 (350/s) TCP.CONNECT.TIMEOUT 4261 (71/s) TCP.SEND.OK 4066794 (67779/s) TCP.SEND.RESET 1049 (17/s)
Some profile
this is with libevent right?
Yes, with libevent with pthreads locks enabled. It's more simply than communicate via pipes or sockets.
Hi, we plan to use carbon-c-relay, @grobian is there any plan to merge libevent patch soon?
@msaf1980 Can we just build a package from your work and use it?
I am a bit hesitant to take the patch(es) because I don't understand them. Mostly don't get why it performs better. That is, I see there is more than just libevent. Since I don't have the time to bring this all in, perhaps it's not a bad idea to let someone else take over and steer the direction for the relay?
Current dispatcher code have some problems:
1. read from all sockets (without poll) 2. full scan connections table when new connections established 3. long duration for lock when connections table resize (need realloc for memory buffers)I do some refactoring of carbon-c-relay dispatcher code (switch dispatcher to libevent and refactor connections table). This reduce cpu and memory usage and we can process more connections on some hardware.
Some tests from my workstations (2000 connections, random delay 50-500 ms):
1. master TCP.CONNECT.OK 14807 (242/s) TCP.CONNECT.TIMEOUT 498811 (8177/s) TCP.SEND.OK 678226 (11118/s) TCP.SEND.RESET 7772 (127/s) 2. Version from https://github.com/msaf1980/carbon-c-relay/tree/libevent_pthread TCP.CONNECT.OK 21049 (350/s) TCP.CONNECT.TIMEOUT 4261 (71/s) TCP.SEND.OK 4066794 (67779/s) TCP.SEND.RESET 1049 (17/s)Some profile
what tool/utility produces measurements like this? Thanks
I use simple stress test from https://github.com/msaf1980/carbontest It's not ideal, but work for me.
But main targets - not a perfomance, we need reduce CPU usage and stably process more than 10000 connections per server.
I do some refactoring of carbon-c-relay dispatcher code (switch dispatcher to libevent and refactor connections table). This reduce cpu and memory u
I am a bit hesitant to take the patch(es) because I don't understand them. Mostly don't get why it performs better. That is, I see there is more than just libevent. Mostly don't get why it performs better.
Poll-like model (lbevent and etc) will be effective on many connections (5000 and greater) and when not 100% connections active in one time. It's real work mode for carbon-c-relay. Direct read from all connections produces high CPU usage and not stable on 10000 or more connections.
In our environment libevent version work well during 2 month without problems.
Hi, we plan to use
carbon-c-relay, @grobian is there any plan to merge libevent patch soon?@msaf1980 Can we just build a package from your work and use it?
Yes, you can.But it's needed on really busy relays.
No, I don't have plans to merge this. I switched from a micro-sleep spin-lock approach to a semaphore-based approach (basically uses notifications), this doesn't require an external dep (libevent). I need to find time to benchmark the throughput somehow, and see if there's obvious blocks. I cannot support the libevent code.
How you plan to detect state of socket (ready for read, idle or connection hungup) with semaphore ? Traditional way to do this - user event-driven pollers like poll, epoll (Linux), kqueue (BSD). libevent is only library and used platform specific poller without need of support low-level platform specific code.
poll() is already used to check which socket has work to do, the semaphore is used to wake up the worker threads.
I might be wrong, but as far as I remember, problem of poll syscall under linux that it takes O(N) of amount of sockets your application have opened. While epoll() is O(1). And on thousands of connections epoll would be much faster (which kinda proven by the fork by @msaf1980, but on a cost of code complexity and portability if you decide to use it directly).
UPD: https://developpaper.com/in-depth-analysis-of-epoll/ - something like that, there were more detailed articles about that.
Or another article: https://idndx.com/2014/09/01/the-implementation-of-epoll-1/
Basically it's highly discouraged to use poll even on single socket when you have thousands of open sockets on Linux.
right, could look at using epoll()
I might be wrong, but as far as I remember, problem of
pollsyscall under linux that it takesO(N)of amount of sockets your application have opened. Whileepoll()isO(1). And on thousands of connectionsepollwould be much faster (which kinda proven by the fork by @msaf1980, but on a cost of code complexity and portability if you decide to use it directly).UPD: https://developpaper.com/in-depth-analysis-of-epoll/ - something like that, there were more detailed articles about that.
Or another article: https://idndx.com/2014/09/01/the-implementation-of-epoll-1/
Basically it's highly discouraged to use poll even on single socket when you have thousands of open sockets on Linux.
Yes, that's right. Epoll is Linux specific. For BSD - kqueue, Solaris - /dev/poll. If we need be a portable, all of them must be supported. It's not trivially and more complex than use one library (which uses platform-specific way internally).
this is a good point, I wasn't aware of that