OpenScraper
OpenScraper copied to clipboard
Scraper halts upon meeting a link with unicode characters
Hi,
I've been sucessfully setting up an openscraper instance. Unfortunately, the spider always stops scraping after 15 results.
After a bit of investigation, here is what seems to be the problems that put the spider to an halt:
::: ERROR scrapy.core.scraper 181122 13:46:58 ::: scraper:158 -in- handle_spider_error() ::: Spider error processing <GET https://www.ademe.fr/actualites/appels-a-projets> (referer: None)
Traceback (most recent call last):
File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/openscraper/.virtualenvs/openscraper/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/openscraper/OpenScraper/openscraper/scraper/masterspider.py", line 609, in parse
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 45: ordinal not in range(128)
Problem was generated by line 600 and line 609.
After analysing the trace, the problem arise when the spider tries to follow this link:
https://appelsaprojets.ademe.fr/aap/H2mobilité2018-82#resultats
So it seems openscraper has a problem handling non purely ascii links.
I don't think it's the scraper per say, I would say it's the log causing the spider to stop due to this error : the .format
function could go bersek with accents when it's used in the log (here in the log_scrap
)...
I remember I had the same issue before... so for a start I would replace :
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link),follow_link) )
by
log_scrap.info(" --> follow_link CLEAN (%s) : %s ", %(type(follow_link),follow_link) )
let us know if it's working at those lines, so you could fix it up by a PR
Well, the problem is that you are passing unicode type variables into a binary string without an explicit encoding. Since Python 2 tries to silently convert between the two types on the fly, it will be ok most of the time, but as soon as the string will not be pure ascii, an error wil be raised.
There are several ways to fix this.
- You could encode data every time you want to log it:
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link.encode('utf-8')))
- You could import unicode_litterals in every file to make sure all strings are unicode and not binary.
from __future__ import unicode_litterals
log_scrap.info(" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))
- You could prefix every string variable with « u » to make sure it is unicode and not binary.
log_scrap.info(u" --> follow_link CLEAN ({}) : {} ".format(type(follow_link), follow_link)))
I will publish a PR with the third solution that allowed me to scrap the entire ademe site without errors, but you might want to check the codebase for places where unicode and binary are mixed. Porting the project to python 3 could also help, since Python 3 does not silently cast unicode and binary.