mimeograph
mimeograph copied to clipboard
nodejs process seems to hang on coffee-resque
I am not certain what is causing this but mimeo seems to hang at some point after coffee-resque has rewritten the process title.
I can't get the complete process title (even when i cat /proc//stat or /proc//cmdline). The tuncated title i get top & /proc/*/stat is "resque-0.1.4: S". This appears to correspond with the following line in coffee-resque:
@procline "Sleeping for #{@conn.timeout/1000}s"
Here is a snapshot of the nodejs process from top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29356 mimeogra 20 0 1034m 156m 5416 R 99.7 4.1 251:03.85 resque-0.1.4: S
I need to determine what tools i can use to check node at runtime and or add additional logging output.
So is it you can't get the complete process title or is the node process hanging? The distinction is important. Don't rely on the process title from coffee-resque for monitoring purposes.
I am not certain what is causing node to hang. The process hangs sometime after the coffee-resque sets the process title to "resque-0.1.4: S" and before something else resets it. However, that covers a lot of possible code.
I am going to try to get the node 0.4.8 debugger working and see if i can get a a list/trace of what code nodejs is stuck on. Then reevaluate. Something to consider is that we are still on an ancient version of nodejs. This problem may be caused by something that has long since been resolved by V8 or node ... although my money is on something silly i did.
Carlos
Adding this here because until 20 minutes ago i didn't know anything about debugging a nodejs app. I had some difficulty connecting with & getting good info from the stock nodejs debugger. However, node-inspector was pretty easy to use. I will attach this to mimeo the next time it hangs ... see if i can get a better sense of what the hell it is locked up on.
OK. So this just happened again.
the call stack that node-inspector goes back 5 frames. it looks like nodejs is waiting on IO
the call stack is as follows:
Socket._onBufferChange net.js:505 Socket._writeOut net.js:444 Socket._flush net.js:529 Socket._onWriteable net.js:609 anonymous function net.js:188
I suspect it is attempting to write to redis ... however, when i went to inspect the socket obj to confirm this mimeo crashed. i suspect that redis didn't like waiting around while i had the process paused. I also have no idea what is being written. It is resque writing to redis, mimeo writing a giant doc etc
I will take another look at this the next time it happens.
Steps to connect debugger:
1 get the pid for mimeo
1 kill -s USR1
then connect a webkit browser on localhost:8080
Carlos
some observations
==spot node is locking up on== the app is spending its time inside of onWritable() in net.js. this is the callback attached to the IOWatcher assigned to the sockets _writeWatcher property. the socket associated to this is connecting to redis. so ... it seems to be taking an unbelievable amount of time and CPU to transfer data to redis.
==slow data transfer with redis== using iptraf between various hosts running mimeo and the redis host it appears that when this condition occurs mimeo is able to transfer very little data to/from the redis host. during this same time other mimeo hosts are able to transfer data very quickly with redis.
example
- node that is spending a lot of time in IOWatcher.onWriteable: incoming/outgoing rates in the tens of kbits p/sec
- node that is flying through jobs: incoming/outgoing rates consistently in the 100s of kbits p/sec with bursts into tens-of -thousands of kbits p/sec
during this time the host with the hanging mimeo is able to rapid transfer files to/from other hosts using scp & wget. restarting mimeo (which appears to require a kill -9) returns performance and data transfer rates to normal.
i have a bunch of data - just no answer to this issue. i probably need to get a better understanding on the nodejs event loop and how IOWatcher plays into that.
anecdotal evidence indicates that this occurs when setting a giant string into redis. This can occur when writing the final stitched pdf back into redis. In some cases the source PDFs are 100s of pages long and approach 25MB in size.