crawlera-tools
crawlera-tools copied to clipboard
Possibly unclean threads termination
Alain Quenneville seen the following exception (running python 2.7.3 on Linux Ubuntu 12.04.3 LTS):
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 504, in run
File "./crawlera-bench", line 30, in worker_request
File "/usr/lib/python2.7/Queue.py", line 168, in get
File "/usr/lib/python2.7/threading.py", line 236, in wait
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
probably it comes from the fact that worker threads are demonized
From Allain's other email:
Yes crawlera sends 2 requests and then goes on exception. I do not have a chance to stop it.
Here is the log trace of the execution below:
Host : proxy.crawlera.com
Concurrency : 2
Timeout : 120 sec
Report interval : 1 sec
Unit : requests per 1 sec
time netloc all 2xx 3xx 4xx 5xx 503 t/o err | minw maxw
2014-09-01 10:09:00 www.supersoccer.co.uk 0 0 0 0 0 0 0 0 | 0.000 0.000
2014-09-01 10:09:01 www.supersoccer.co.uk 0 0 0 0 0 0 0 0 | 0.000 0.000
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 504, in run
File "./crawlera-bench", line 30, in worker_request
File "/usr/lib/python2.7/Queue.py", line 168, in get
File "/usr/lib/python2.7/threading.py", line 236, in wait
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Here is the script I use to install and run:
wget https://raw.githubusercontent.com/scrapinghub/crawlera-tools/master/crawlera-bench
chmod a+x crawlera-bench
echo "http://www.supersoccer.co.uk/events/horses1.asp" >> urls.txt
apt-get -y install python python-pip
pip install requests
./crawlera-bench urls.txt -u <USR> -p <PWD> -c 10
Looks to be http://bugs.python.org/issue14623
and probably we need to add time.sleep(1)
as a workaround
one thing I noticed is that we exit using os._exit(1)
os._exit(n) Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.
I stumbled on some issues with bench when running crawlera benchmarks that take bit longer, seems like fab task using crawlera bench gets stuck. I think there are two problems here first of all exiting with 1 means that we're exiting with error on unix, this means that fab or any other script that relies on exit codes is getting confused. Secondly we're not doing any kind of cleanup of all those threads that are running, we just interrupt whole thing in the middle.
As the code is currently written it seems difficult to refactor all this so that it stops when it should. Perhaps we could just migrate script to async library? e.g. gevents? It would give us same performance but without problems related to threads.