node-mariasql 100% CPU usage using clusters and mariaSQL, when cluster is exit and restarted

Hi @mscdex,

Sorry for the long post but I'm trying to give you as much information as I possibly can. I have done some snooping around, so I'm including everything that I've discovered.

On my Node.JS app I'm using clusters to utilize my multi-core CPU. I'm using the node's mariasql library to communicate with my database. Since the node-mariasql library does not support pooling, we are using the third party - generic-pool to maintain a pool of connections.

We've noticed that our CPU usage goes upto 100% whenever a connection in the master thread is closed after an uncaught exception causes one of the child clusters to restart.

Whenever a child cluster restarts, we are destroying all MySQL connections.

Node version - v4.2.2 MariaDB version - v10.0.15 node-mariasql version - v0.2.5

Reproducible code - https://github.com/bsurendrakumar/node-simplex/

The minimum number of connections in the pool is set to 5. All its configuration can be found under here. So when the server starts, generic pool will kick of 5 connections to MySQL and keep them in its pool.

The idleTimeout for an object in the pool has been set to 120 seconds. This means that if there are more than 5 (since 5 is the minimum) objects in the pool and one of them has not been used for the last 120 seconds, it'll be destroyed.

At server startup, we're making a simple call to our country model to fetch the list of countries. This code is here. This establishes a new connection to the database, so now in the pool there'll be a 6 SQL connection in the pool and one of which will get cleaned after 120 seconds.

Following is the step by step process via which, we believe that the issue is with our usage of the mariasql library -

When the server is started, we are logging the process ID to the console. Grab the mater process ID, for example - 20584.
Look at the file descriptors being used by the process by using - ls -l /proc/20584/fd. Make a note of the socket connections. The output of this will look something like this -

lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 12 -> socket:[2469914]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 13 -> socket:[2469917]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 14 -> socket:[2468106]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 15 -> socket:[2468109]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 17 -> socket:[2467206]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 18 -> socket:[2467208]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 19 -> socket:[2467210]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 2 -> /dev/tty
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 20 -> socket:[2467212]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 21 -> socket:[2467214]
lrwx------ 1 abijeet abijeet 64 Jun  9 19:24 22 -> socket:[2467306]

Copy few of the sockets numbers for example 2467212, and run lsof | grep 2467212. You'll notice that these are connections to the MySQL server. The output of that should be something like -

node      20584           abijeet   20u     IPv4            2467212       0t0        TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8        20584 20585     abijeet   20u     IPv4            2467212       0t0        TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8        20584 20586     abijeet   20u     IPv4            2467212       0t0        TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8        20584 20587     abijeet   20u     IPv4            2467212       0t0        TCP localhost:57092->localhost:mysql (ESTABLISHED)
V8        20584 20588     abijeet   20u     IPv4            2467212       0t0        TCP localhost:57092->localhost:mysql (ESTABLISHED)

Crash the server by going to http://127.0.0.1:3002/api/v1/country/list. This will crash one of the child processes. Whenever an uncaught exception occurs, we do some cleanup and exit. We then fork another process to take the place of the one that was just killed. Cleanup includes -
- Closing http server
- Closing MySQL connections in generic pool
- Closing winston logger streams.
Wait for the MySQL connection in the master thread to be closed. When this happes, we are writing a log to the console -

Destroying / ending master thread ID -  4984

Check your CPU usage, you'll notice that one of the CPU's has shot upto 100%.
Next run, strace -o log.txt -eepoll_ctl,epoll_wait -p 20584. Note that you might need to install strace. This command logs all the epoll_ctl, epoll_wait system calls made by the Node.JS process and puts it inside a file named log.txt the current working directory.
Open the log.txt file, and you'll notice logs similar to these -

epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 847) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 845) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)
epoll_wait(5, {{EPOLLIN|EPOLLHUP, {u32=16, u64=16}}}, 1024, 843) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 16, 7ffe441aa850) = -1 EBADF (Bad file descriptor)

The file descriptor here is 16, and if you co-relate it with your earlier ls -l /proc/20584/fd and lsof | grep 2467212, you'll realize that this belongs to the MySQL connection that was just closed.

This leads us to believe that somewhere, even when the connection to MySQL is released, there is a file descriptor hanging there, that is still being used. We've found various threads across forums -

https://github.com/joyent/libuv/issues/1099
https://github.com/nodejs/node-v0.x-archive/issues/6271
https://github.com/joyent/libuv/issues/826

Jun 09 '16 16:06 AbijeetP

@mscdex - I've update the reproducible sample to have all the code in a single server.js file. It should now be much easier to trace and debug. Sorry I should have done that earlier.

The repository is still the same - https://github.com/bsurendrakumar/node-simplex/

Jun 14 '16 09:06 AbijeetP

100% CPU usage issue is not happening with the node-mysql module. Updated the code using this module.

Please find this code here.

Jun 21 '16 12:06 sindhushaballa

Sorry all, I haven't had time to look into this just yet.

Jun 21 '16 13:06 mscdex

Hi @mscdex,

Did you get a chance to look into this issue?

Jun 28 '16 16:06 sindhushaballa

Can you verify that this still happens with mariasql v0.2.6?

Jun 28 '16 16:06 mscdex

Hi @mscdex

I updated the version to 0.2.6, still I'm seeing 100% CPU usage.

Thanks Surendra Kumar B

Jul 04 '16 11:07 sindhushaballa

Hi @mscdex

Still, we are facing this issue with 0.2.6.

Could you please look into this issue when you have some time. Sorry for troubling you for couple of times from long back.

Thanks Surendra Kumar B

Jul 14 '16 11:07 sindhushaballa

@mscdex - Did you get a chance to look at this? Do you need anything else from us on this one?

Jul 25 '16 14:07 AbijeetP