Benjamin Estes
Benjamin Estes
Updated robots.txt in the CactusBlog example to reflect Cactus update: https://github.com/koenbok/Cactus/commit/8aef21732bfc9aaa338306c540422e7141540cdc
If for some reason a site blocks its own sitemap with a robots.txt file, the crawler should respect that and not request the sitemaps in sitemap mode.
(see subject)
The Config file is the most error-prone part of the process from the user's perspective. However, we can't really get around this — there are just a lot of choices...
From GK's input file: ``` Traceback (most recent call last): File "/Users/benjamin/.virtualenvs/test5/bin/pyscape", line 5, in pkg_resources.run_script('pyscape-client==2015.02b2', 'pyscape') File "/Users/benjamin/.virtualenvs/test5/lib/python3.4/site-packages/pkg_resources.py", line 534, in run_script self.require(requires)[0].run_script(script_name, ns) File "/Users/benjamin/.virtualenvs/test5/lib/python3.4/site-packages/pkg_resources.py", line 1441, in...
While getting numerous URLs using the CLI: ``` Traceback (most recent call last): File "/Users/benjamin/.virtualenvs/test5/lib/python3.4/site-packages/requests-2.5.3-py3.4.egg/requests/packages/urllib3/connectionpool.py", line 372, in _make_request httplib_response = conn.getresponse(buffering=True) TypeError: getresponse() got an unexpected keyword argument 'buffering'...