SQLAlchemyBackend does not go with scrapy_splash
Args for splash is dumped with json in SplashMiddleware.
body = json.dumps(args, ensure_ascii=False, sort_keys=True, indent=4)
However, the QueueModel used by sqlalchemy backend does not record the field about request.body.
One possible solution is to add a new field and the process of storing and accessing to request. After so, frontera on my pc seems to function well with scrapy_splash except for that url/domain fprintmws needs replacing.
class QueueModelMixin(object):
__table_args__ = (
{
'mysql_charset': 'utf8',
'mysql_engine': 'InnoDB',
'mysql_row_format': 'DYNAMIC',
},
)
id = Column(Integer, primary_key=True)
partition_id = Column(Integer, index=True)
score = Column(Float, index=True)
url = Column(String(1024), nullable=False)
fingerprint = Column(String(40), nullable=False)
host_crc32 = Column(Integer, nullable=False)
meta = Column(PickleType())
# to add body=Column(String(1024))
headers = Column(PickleType())
cookies = Column(PickleType())
method = Column(String(6))
created_at = Column(BigInteger, index=True)
depth = Column(SmallInteger)
Yeah, I absolutely agree with adding this field.
I'm not sure String(1024) is enough. Body is required to handle POST or PUT requests properly, this is not specific to scrapy-splash. Also, request bodies are binary, not strings. So something like LargeBinary (a blob) looks like a better fit.
As for the issue with scrapy_splash, I found a relatively simple solution. Just like how frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler deals with redirected request, the request to SPLASH_URL can be cached to the pending queue rather than persisted to backend
def enqueue_request(self, request):
if not self._request_is_redirected(request):
self.frontier.add_seeds([request])
self.stats_manager.add_seeds()
return True
elif self.redirect_enabled:
self._add_pending_request(request)
self.stats_manager.add_redirected_requests()
return True
return False
The possible solution would be like below
def __init__(self, crawler, manager=None):
self.settings = crawler.settings
def enqueue_request(self, request):
# add scheduler support for splash request avoid sending to backend.
splash_url = self.settings.get('SPLASH_URL')
if splash_url and splash_url in request.url:
self._add_pending_request(request)
self.logger.info('Recycle SplashRequest to pending queue')
return True
elif not self._request_is_redirected(request):
self.frontier.add_seeds([request])
self.stats_manager.add_seeds()
return True
elif self.redirect_enabled:
self._add_pending_request(request)
self.stats_manager.add_redirected_requests()
return True
return False
It would save the job to customize sqlalchemy model and fingerprint module. It seems to work fine on my pc. (frontera-0.7.0, scrapy-1.2.2)
@dingld the only cons is that will not survive process restart, but for some applications this isn't necessary. For a general purpose solution I would extend SQLA backend with the fields needed. Anyone would like to make a PR?
Hi @sibiryakov
I have overridden the FronteraScheduler to adapt changes suggested by @dingld to make my splash request work. However I didn't understand your comment.
Would take some moment to explain that please?
Thanks.