scrapy-selenium
scrapy-selenium copied to clipboard
scrapy-selenium is yielding normal scrapy.Request instead of SeleniumRequest
@clemfromspace I just decided to use your package in my Scrapy project but it is just yielding normal scrapy.Requuest instead of SeleniumRequest
from shutil import which
from scrapy_selenium import SeleniumRequest
from scrapy.contracts import Contract
class WithSelenium(Contract):
""" Contract to set the request class to be SeleniumRequest for the current call back method to test
@with_selenium
"""
name = 'with_selenium'
request_cls = SeleniumRequest
class WebsiteSpider(BaseSpider):
name = 'Website'
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy_selenium.SeleniumMiddleware': 800
},
'SELENIUM_DRIVER_NAME': 'firefox',
'SELENIUM_DRIVER_EXECUTABLE_PATH': which('geckodriver'),
'SELENIUM_DRIVER_ARGUMENTS': ['-headless']
}
def start_requests(self):
yield SeleniumRequest(url=url,
callback=self.parse_result)
def parse_result(self, response):
"""
@with_selenium
"""
print(response.request.meta['driver'].title) --> gives key error
I have seen this issue but this is not helpful at all
I plus-oned this and then solved it for myself a little later.
For me this is not in the context of testing, so I have no need for contracts (at least as far as understand it).
My solve was the following:
- Override
start_requests()
(as you have done) - yield SeleniumRequest() in
parse_result
. I notice that you useparse_result()
instead ofparse()
Once I did this it started working. My solution snippet:
def start_requests(self):
cls = self.__class__
if not self.start_urls and hasattr(self, 'start_url'):
raise AttributeError(
"Crawling could not start: 'start_urls' not found "
"or empty (but found 'start_url' attribute instead, "
"did you miss an 's'?)")
for url in self.start_urls:
yield SeleniumRequest(url=url, dont_filter=True)
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
yield SeleniumRequest(
url=link.url,
callback=self.parse_result
)
def parse_result(self, response):
page = PageItem()
page['url'] = response.url
yield page
Hey @undernewmanagement
I tried your snippet but the links in LinkExtractor are not processed correctly (response body is not text).
rules = ( Rule(LinkExtractor(restrict_xpaths=(['//*[@id="breadcrumbs"]'])), follow=True),)
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(url=url, dont_filter=True,)
def parse_start_url(self, response):
return self.parse_result(response)
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
yield SeleniumRequest(url=link.url, callback=self.parse_result,)
def parse_result(self, response):
page = PageItem()
page['url'] = response.url
yield page
I had to use parse_start_url to assign the parse_result callback to start urls.
Do you know what the problem could be? I'm new in Scrapy and Python.
Thanks!
Hey @educatron thanks for the question - let's not hijack the thread here. I think you should take that question directly to the scrapy community. https://scrapy.org/community/
Ok. Thanks!