scrapy-deltafetch
scrapy-deltafetch copied to clipboard
Fingerprint for initial request is not saved on redirects
Hi,
I have a spider that makes usage of FormRequest
, item loaders and Request
.
Here's an example for a FormRequest:
yield FormRequest(url, callback, formdata)
Here for an item loader:
il = ItemLoader(item=MyResult())
il.add_value('date', response.meta['date'])
yield il.load_item()
And here for a request:
page_request = Request(url, callback=self.parse_run_page)
yield page_request
Deltafetch is enabled, creates a .db file, but with every spider run, Scrapy does all page requests again, so no delta processing is achieved.
Any ideas? Thanks.
The reason for this issue was that the URL I yielded a FormRequest
started with http://
while the server redirected me to the https://
version of the website (same URL, just with HTTPS) and deltafetch considered these two pages as equivalent and therefore decided to process it again in the next run.
Maybe this should be documented in the Wiki and/or http/https of the same page being ignored with an option.
I don't understand the issue/the behavior you want to be documented. Can you explain with a timeline what's happening?
I think this could be added to a FAQ or a Wiki to help users prevent tedious debugging sessions. When the URL scraped from a page is different just because the server redirects to the HTTPS version of the page, then deltafetch will process it again which is not obvious.
Maybe the reason why a page is not cached could also be logged in debug mode. What do you think?
Hello @mrueegg , sorry it took so long but I had a look at this again this morning, and I think I understand the issue now. I'm a bit slow sometimes ;-)
You are right that when requests are redirected, the deltafetch middleware stores the fingerprint of the redirected/final request made, and not the starting request. Here's an example spider showing that:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.utils.request import request_fingerprint
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
start_urls = ['http://httpbin.org/']
custom_settings = {
'ROBOTSTXT_OBEY': False,
}
def parse(self, response):
r = scrapy.Request('http://docs.scrapy.org',
callback=self.parse_page)
self.logger.info("requesting %r (fingerprint: %r)" % (r, request_fingerprint(r)))
yield r
def parse_page(self, response):
self.logger.info("parse_page(%r); request %r (fingerprint: %r)" % (
response, response.request, request_fingerprint(response.request)))
yield {'url': response.url}
And the logs showing that the saved fingerprint is the one for the last hop of redirects:
$ scrapy crawl httpbin
2016-12-09 14:49:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: deltafetchredirect)
(...)
2016-12-09 14:49:49 [scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2016-12-09 14:49:49 [httpbin] INFO: requesting <GET http://docs.scrapy.org> (fingerprint: 'c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051')
2016-12-09 14:49:49 [scrapy] DEBUG: Redirecting (301) to <GET https://docs.scrapy.org/> from <GET http://docs.scrapy.org>
2016-12-09 14:49:50 [scrapy] DEBUG: Redirecting (302) to <GET https://docs.scrapy.org/en/latest/> from <GET https://docs.scrapy.org/>
2016-12-09 14:49:50 [scrapy] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: http://httpbin.org/)
2016-12-09 14:49:50 [httpbin] INFO: parse_page(<200 https://docs.scrapy.org/en/latest/>); request <GET https://docs.scrapy.org/en/latest/> (fingerprint: '04eee400963f6f786a539be3e465ad0f8054e4e7')
2016-12-09 14:49:50 [scrapy] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/>
{'url': 'https://docs.scrapy.org/en/latest/'}
2016-12-09 14:49:50 [scrapy] INFO: Closing spider (finished)
2016-12-09 14:49:50 [scrapy] INFO: Dumping Scrapy stats:
{'deltafetch/stored': 1,
'downloader/request_bytes': 948,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 37817,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 9, 13, 49, 50, 781514),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 9,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 12, 9, 13, 49, 49, 215001)}
2016-12-09 14:49:50 [scrapy] INFO: Spider closed (finished)
$ cd .scrapy/deltafetch/
$ ls
httpbin.db
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bsddb3
>>> db = bsddb3.db.DB()
>>> db.open('httpbin.db')
>>> db
<DB object at 0x7f18f068d880>
>>> db.keys()
['04eee400963f6f786a539be3e465ad0f8054e4e7']
>>>
The original fingerprint for http://docs.scrapy.org, c96b7ce72fabf56ccbee0cc80e8eaba2f38e5051
, does not get saved. Instead the one for https://docs.scrapy.org/en/latest/, 04eee400963f6f786a539be3e465ad0f8054e4e7
, is saved. On a subsequent crawl, the spider will still not issue a request to https://docs.scrapy.org/en/latest/ directly, so deltafetch will not see this as duplicate.
So the issue is confirmed. The thing is I don't know how to (easily) solve it at the minute.
The case can be handled with custom 'deltafetch_key':
import hashlib
request = scrapy.Request(original_url, callback=self.parse_item, meta={'deltafetch_key':
hashlib.sha1(original_url).hexdigest()})