100% CPU usage using clusters and mariaSQL, when cluster is exit and restarted
Hi @mscdex,
Sorry for the long post but I'm trying to give you as much information as I possibly can. I have done some snooping around, so I'm including everything that I've discovered.
On my Node.JS app I'm using clusters to utilize my multi-core CPU. I'm using the node's mariasql library to communicate with my database. Since the node-mariasql library does not support pooling, we are using the third party - generic-pool to maintain a pool of connections.
We've noticed that our CPU usage goes upto 100% whenever a connection in the master thread is closed after an uncaught exception causes one of the child clusters to restart.
Whenever a child cluster restarts, we are destroying all MySQL connections.
Node version - v4.2.2 MariaDB version - v10.0.15 node-mariasql version - v0.2.5
Reproducible code - https://github.com/bsurendrakumar/node-simplex/
The minimum number of connections in the pool is set to 5. All its configuration can be found under here. So when the server starts, generic pool will kick of 5 connections to MySQL and keep them in its pool.
The idleTimeout for an object in the pool has been set to 120 seconds. This means that if there are more than 5 (since 5 is the minimum) objects in the pool and one of them has not been used for the last 120 seconds, it'll be destroyed.
At server startup, we're making a simple call to our country model to fetch the list of countries. This code is here. This establishes a new connection to the database, so now in the pool there'll be a 6 SQL connection in the pool and one of which will get cleaned after 120 seconds.
Following is the step by step process via which, we believe that the issue is with our usage of the mariasql library -
- When the server is started, we are logging the process ID to the console. Grab the mater process ID, for example - 20584.
- Look at the file descriptors being used by the process by using -
ls -l /proc/20584/fd. Make a note of the socket connections. The output of this will look something like this -
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 12 -> socket:[2469914]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 13 -> socket:[2469917]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 14 -> socket:[2468106]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 15 -> socket:[2468109]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 17 -> socket:[2467206]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 18 -> socket:[2467208]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 19 -> socket:[2467210]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 2 -> /dev/tty
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 20 -> socket:[2467212]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 21 -> socket:[2467214]
lrwx------ 1 abijeet abijeet 64 Jun 9 19:24 22 -> socket:[2467306]
- Copy few of the sockets numbers for example 2467212, and run
lsof | grep 2467212. You'll notice that these are connections to the MySQL server. The output of that should be something like -
node 20584 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8 20584 20585 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8 20584 20586 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8 20584 20587 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8 20584 20588 abijeet 20u IPv4 2467212 0t0 TCP localhost:57092->localhost:mysql (ESTABLISHED)
- Crash the server by going to http://127.0.0.1:3002/api/v1/country/list. This will crash one of the child processes. Whenever an uncaught exception occurs, we do some cleanup and exit. We then fork another process to take the place of the one that was just killed. Cleanup includes -
- Closing http server
- Closing MySQL connections in generic pool
- Closing winston logger streams.
- Wait for the MySQL connection in the master thread to be closed. When this happes, we are writing a log to the console -
Destroying / ending master thread ID - 4984
- Check your CPU usage, you'll notice that one of the CPU's has shot upto 100%.
- Next run,
strace -o log.txt -eepoll_ctl,epoll_wait -p 20584. Note that you might need to install strace. This command logs all theepoll_ctl, epoll_waitsystem calls made by the Node.JS process and puts it inside a file named log.txt the current working directory. - Open the log.txt file, and you'll notice logs similar to these -
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 847) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 845) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 843) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
- The file descriptor here is 16, and if you co-relate it with your earlier
ls -l /proc/20584/fdandlsof | grep 2467212, you'll realize that this belongs to the MySQL connection that was just closed.
This leads us to believe that somewhere, even when the connection to MySQL is released, there is a file descriptor hanging there, that is still being used. We've found various threads across forums -
- https://github.com/joyent/libuv/issues/1099
- https://github.com/nodejs/node-v0.x-archive/issues/6271
- https://github.com/joyent/libuv/issues/826
@mscdex - I've update the reproducible sample to have all the code in a single server.js file. It should now be much easier to trace and debug. Sorry I should have done that earlier.
The repository is still the same - https://github.com/bsurendrakumar/node-simplex/
100% CPU usage issue is not happening with the node-mysql module. Updated the code using this module.
Please find this code here.
Sorry all, I haven't had time to look into this just yet.
Hi @mscdex,
Did you get a chance to look into this issue?
Can you verify that this still happens with mariasql v0.2.6?
Hi @mscdex
I updated the version to 0.2.6, still I'm seeing 100% CPU usage.
Thanks Surendra Kumar B
Hi @mscdex
Still, we are facing this issue with 0.2.6.
Could you please look into this issue when you have some time. Sorry for troubling you for couple of times from long back.
Thanks Surendra Kumar B
@mscdex - Did you get a chance to look at this? Do you need anything else from us on this one?