RENDLER
RENDLER copied to clipboard
Python Documentation
It might be worth noting that you need a few things on your system to get this working for the Python example.
Python Modules:
You will receive this error if you try and run without installing a few modules.
File "crawl_executor.py", line 25, in <module>
from bs4 import BeautifulSoup
ImportError: No module named bs4
Install the following:
sudo pip install wget
sudo pip install beautifulsoup4
sudo pip install html5lib
sudo yum install -y libxml2-devel
sudo yum install -y libxslt-devel
sudo yum install -y python-devel
sudo pip install lxml
PhantomJS:
You will get errors about PhantomJs like the following:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "render_executor.py", line 62, in run_task
if call(["phantomjs", "render.js", url, destination]) != 0:
File "/usr/lib64/python2.7/subprocess.py", line 524, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
To resolve that you need to build PhantomJs from source. If you can find a binary for your Linux distro then go with that. I used a binary I found for Centos 7 here. Note there are some issues bundling binaries for PhantomJs see thead here. If you must build from source follow the steps below it can take an hour or so.
# needed to phantomjs build from source
sudo yum -y install gcc gcc-c++ make flex bison gperf ruby \
openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel \
libpng-devel libjpeg-devel
git clone --recurse-submodules https://github.com/ariya/phantomjs.git
cd phantomjs
./build.py
Parser Warning on BS4:
Also the Executer throws a nice warning about not explicitly specifying the parser for BS4 that appears to halt the script.
Executor registered on slave 586d51bc-408a-4191-bce7-8527a6c0f2f4-S0
/usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this (See PR #41):
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")