OpenScraper
OpenScraper copied to clipboard
How to debug spider?
Hi @JulienParis,
I'm testing my own instance of OpenScraper.
So far, despite reading the documention, I've been unable to get any real data out of OpenScraper.
I've defined a simple data model (one field), added a simple contributor, but when I "Crawl" the spider, the dataset stays empty.
Now, I'm not too sure where to go from here. I've tested and re-tested my xpaths expressions, and although I might be wrong, it seems to me everything is ok here. How do I get feedback about the scraping results? How do I know what happened during the scrolling and what went wrong exactly?
For now the only way to get feedbacks while scrapping is to run it with the terminal open (for instance having your local instance run from the terminal and checking the outputs, or checking the log files)...
Could you share your scraper config (screenshot) to get an idea how you had your first try ?
Hi @thibault, good to see you here :-) (i don't have answers to your questions, just saying hi :-) )
@thibault I'm also trying with my own instance but get no results from "http://www.ademe.fr/actualites/appels-a-projets "... same as you :( ... Trying to figure out what is the bug...
I tried with that :
- start_urls :
http://www.ademe.fr/actualites/appels-a-projets
- item_xpath :
//section/ul/li
- name (or whatever custom field) :
.//div[@class="content"]//h2/a/text()
I got nothing weird in my log, no error message, but the page is not loaded...
::: INFO log_pipeline 181121 18:58:15 ::: pipelines:80 -in- __init__() ::: >>> MongodbPipeline / __init__ ...
::: INFO log_pipeline 181121 18:58:15 ::: pipelines:87 -in- __init__() ::: --- MongodbPipeline / os.getcwd() : /Users/jpy/Dropbox/_FLASK/_CIS/_POC_EIG/CIS_scrapnado/openscraper
::: INFO scrapy.middleware 181121 18:58:15 ::: middleware:53 -in- from_settings() ::: Enabled item pipelines:
['scraper.pipelines.MongodbPipeline']
::: INFO scrapy.core.engine 181121 18:58:15 ::: engine:256 -in- open_spider() ::: Spider opened
::: DEBUG log_pipeline 181121 18:58:15 ::: pipelines:116 -in- open_spider() ::: >>> MongodbPipeline / open_spider ...
::: INFO scrapy.extensions.logstats 181121 18:58:15 ::: logstats:48 -in- log() ::: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
::: INFO log_scraper 181121 18:58:15 ::: masterspider:354 -in- start_requests() ::: --- GenericSpider.start_requests ...
::: INFO log_scraper 181121 18:58:15 ::: masterspider:358 -in- start_requests() ::: --- GenericSpider.start_requests / url : http://www.ademe.fr/actualites/appels-a-projets
::: INFO log_scraper 181121 18:58:15 ::: masterspider:363 -in- start_requests() ::: --- GenericSpider.start_requests / starting first Scrapy request...
::: INFO scrapy.core.engine 181121 18:58:16 ::: engine:295 -in- close_spider() ::: Closing spider (finished)
::: DEBUG log_pipeline 181121 18:58:16 ::: pipelines:137 -in- close_spider() ::: >>> MongodbPipeline / close_spider ...
Very weird indeed
Meanwhile you can start to try with this website to check if it's the code or the website creating trouble :
... I added the quotestoscrap
scraper and its working fine... It must be something related to the ademe website (or the scrapy settings because requests are doing fine)...
I tried that with a pure request from a python shell :
>>> import requests
>>> r = requests.get('http://www.ademe.fr/actualites/appels-a-projets')
>>> print r.content
and no problem... So it's Scrapy or the website
@thibault
I think I got it !! There is something going wrong with the scrapy settings...
I commented the line 139 in masterspider.py
file :
this one --> settings.set( "RANDOMIZE_DOWNLOAD_DELAY" , RANDOMIZE_DOWNLOAD_DELAY )
And then I could scrap the ademe website.
So you could either comment this same line on your instance, or change the RANDOMIZE_DOWNLOAD_DELAY
var to false( RANDOMIZE_DOWNLOAD_DELAY = False
in you settings_scrapy.py file
)... Or even better I could add this option in the "advanced settings" as a new feature ...
@thibault so I added some new features in "advanced settings" with this commit : https://github.com/entrepreneur-interet-general/OpenScraper/commit/92d99089b7c01b903b3a5e005447ad6bfbc7d47f
This allows to override the default Scrapy settings with your own advanced settings. For instance in your case with Ademe those settings seems to work :
@JulienParis Wow, it seems I gave you work for the entire afternoon :)
Thank you for taking the time to help. I will try your solution, and will get back to you with the results.
@DavidBruant Hi ! :)