crawlera-tools Possibly unclean threads termination

Alain Quenneville seen the following exception (running python 2.7.3 on Linux Ubuntu 12.04.3 LTS):

Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 504, in run
  File "./crawlera-bench", line 30, in worker_request
  File "/usr/lib/python2.7/Queue.py", line 168, in get
  File "/usr/lib/python2.7/threading.py", line 236, in wait
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

probably it comes from the fact that worker threads are demonized

Aug 25 '14 05:08 qrilka

From Allain's other email:

Yes crawlera sends 2 requests and then goes on exception. I do not have a chance to stop it.

Here is the log trace of the execution below:

Host            : proxy.crawlera.com
Concurrency     : 2
Timeout         : 120 sec
Report interval : 1 sec
Unit            : requests per 1 sec

time                netloc                           all   2xx   3xx   4xx   5xx   503   t/o   err  |      minw     maxw
2014-09-01 10:09:00 www.supersoccer.co.uk              0     0     0     0     0     0     0     0  |     0.000    0.000
2014-09-01 10:09:01 www.supersoccer.co.uk              0     0     0     0     0     0     0     0  |     0.000    0.000
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 504, in run
  File "./crawlera-bench", line 30, in worker_request
  File "/usr/lib/python2.7/Queue.py", line 168, in get
  File "/usr/lib/python2.7/threading.py", line 236, in wait
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

Here is the script I use to install and run:
wget https://raw.githubusercontent.com/scrapinghub/crawlera-tools/master/crawlera-bench
chmod a+x crawlera-bench
echo "http://www.supersoccer.co.uk/events/horses1.asp" >> urls.txt
apt-get -y install python python-pip
pip install requests
./crawlera-bench urls.txt -u <USR> -p <PWD> -c 10

Sep 12 '14 09:09 qrilka

Looks to be http://bugs.python.org/issue14623 and probably we need to add time.sleep(1) as a workaround

Sep 12 '14 10:09 qrilka

one thing I noticed is that we exit using os._exit(1)

os._exit(n) Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.

docs

I stumbled on some issues with bench when running crawlera benchmarks that take bit longer, seems like fab task using crawlera bench gets stuck. I think there are two problems here first of all exiting with 1 means that we're exiting with error on unix, this means that fab or any other script that relies on exit codes is getting confused. Secondly we're not doing any kind of cleanup of all those threads that are running, we just interrupt whole thing in the middle.

As the code is currently written it seems difficult to refactor all this so that it stops when it should. Perhaps we could just migrate script to async library? e.g. gevents? It would give us same performance but without problems related to threads.

Feb 09 '16 14:02 pawelmhm

crawlera-tools crawlera-tools copied to clipboard

Possibly unclean threads termination

crawlera-tools
crawlera-tools copied to clipboard