scrapy-zyte-api
                                
                                
                                
                                    scrapy-zyte-api copied to clipboard
                            
                            
                            
                        Provide an MVP implementation of a session middleware
To do:
- [ ] Confirm the user-facing API is as agreed with @VMRuiz and @proway2.
 - [x] Make existing tests pass.
- [x] Restore compatibility with Scrapy 2.0.1+.
 
 - [x] Perform a minimal test with a real spider.
 - [ ] Provide complete test coverage.
 - [x] Add more session links to the docs once client-managed session documentation has been published.
 - [ ] Get Zyte API to fail with a proper status code for expired sessions
 - [ ] Document the entire configuration needed (disabling cookies, using a custom retry policy that does not retry download errors, enabling a body-providing field for providers)
 
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.56%. Comparing base (
a2284c8) to head (c3ed86f). Report is 6 commits behind head on main.
:exclamation: Current head c3ed86f differs from pull request most recent head c01d368
Please upload reports for the commit c01d368 to get more accurate results.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #193      +/-   ##
==========================================
- Coverage   98.45%   97.56%   -0.90%     
==========================================
  Files          13       14       +1     
  Lines        1102     1476     +374     
  Branches        0      309     +309     
==========================================
+ Hits         1085     1440     +355     
+ Misses         17       15       -2     
- Partials        0       21      +21     
| Files | Coverage Δ | |
|---|---|---|
| scrapy_zyte_api/__init__.py | 100.00% <100.00%> (ø) | 
|
| scrapy_zyte_api/_middlewares.py | 97.65% <100.00%> (ø) | 
|
| scrapy_zyte_api/_session.py | 100.00% <100.00%> (ø) | 
|
| scrapy_zyte_api/addon.py | 98.07% <100.00%> (-1.93%) | 
:arrow_down: | 
| scrapy_zyte_api/utils.py | 100.00% <100.00%> (ø) | 
I have created a project based on https://github.com/zytedata/zyte-spider-templates-project, added the following spider to it:
from logging import getLogger
from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
logger = getLogger(__name__)
class custom_throttling_stop(stop_base):
    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )
class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)
    # Handle temporary bug
    throttling_stop = custom_throttling_stop()
SESSION_RETRY_POLICY = CustomRetryFactory().build()
class _SessionChecker:
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)
    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]
    def check_session(self, request: Request, response: Response) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False
from zyte_spider_templates import EcommerceSpider
class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"
    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}
        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0
        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses
And executed it as follows:
scrapy crawl session_ecommerce -a "url=https://ecommerce.example/product-list" -a crawl_strategy=pagination_only
It seems to work well enough. Although retries can be exceeded, i.e. a higher RETRY_TIMES value might be something worth recommending in the docs.
@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?
@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?
For a short crawl I performed just now:
'scrapy-zyte-api/processed': 21
'scrapy-zyte-api/status_codes/200': 18
'scrapy-zyte-api/status_codes/521': 3
'zyte-api-session/checks/passed': 18
'zyte-api-session/sessions': 11
i.e. sessions were created 11 times because the default is creating 8, and 3 got 521. Once a valid session was created, all (7) usages succeeded. Of course, it is a rather small sample.
In any case, I will now try to build some of the ideas by @VMRuiz, of a better API for location, into this PR (including per-domain web-poet-like configurations).
https://github.com/scrapy-plugins/scrapy-zyte-api/pull/193/commits/0638df7942b62ca13b937d9f9a275d59a2ac6dff is based on the ideas shared by @VMRuiz elsewhere.
It enables both a location-specific approach with poet-based overrides:
from logging import getLogger
from typing import Any
from pydantic import BaseModel, Field
from pydantic.types import Json
from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from scrapy_spider_metadata import Args
from scrapy_zyte_api import SessionConfig, session_config
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.base import ARG_SETTING_PRIORITY
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams
logger = getLogger(__name__)
class custom_throttling_stop(stop_base):
    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )
class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)
    # Handle temporary bug
    throttling_stop = custom_throttling_stop()
SESSION_RETRY_POLICY = CustomRetryFactory().build()
@session_config("ecommerce.example")
class EcommerceExampleLocationSessionConfig(SessionConfig):
    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        expected_zip_code = self.location(request)["postalCode"]
        if zip_code == expected_zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{expected_zip_code!r})."
        )
        return False
class LocationParam(BaseModel):
    location: Json[Any] = Field(default_factory=dict)
class LocationSpiderParams(LocationParam, EcommerceSpiderParams):
    pass
class LocationEcommerceSpider(EcommerceSpider, Args[LocationSpiderParams]):
    name = "location_ecommerce"
    @classmethod
    def from_crawler(cls, crawler: Crawler, *args, **kwargs) -> Spider:
        spider = super(LocationEcommerceSpider, cls).from_crawler(crawler, *args, **kwargs)
        if spider.args.location:
            spider.settings.set(
                "ZYTE_API_SESSION_ENABLED",
                True,
                priority=ARG_SETTING_PRIORITY,
            )
            spider.settings.set(
                "ZYTE_API_SESSION_LOCATION",
                spider.args.location,
                priority=ARG_SETTING_PRIORITY,
            )
        return spider
    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}
        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0
        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses
But it also supports a non-location-specific approach, as well as a non-poet-like definition of session initialization parameters and a check function:
from logging import getLogger
from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
logger = getLogger(__name__)
class custom_throttling_stop(stop_base):
    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )
class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)
    # Handle temporary bug
    throttling_stop = custom_throttling_stop()
SESSION_RETRY_POLICY = CustomRetryFactory().build()
class _SessionChecker:
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)
    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]
    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False
from zyte_spider_templates import EcommerceSpider
class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"
    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}
        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0
        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses
Also, multiple session pools are now supported, and by default each domain has its own pool. The stats now show check passes and failures separately depending on whether they happened during session initialization or regular response checks, and are split per session pool:
 'zyte-api-session/checks/ecommerce.example/passed': 7,
 'zyte-api-session/init/ecommerce.example/failed': 2,
 'zyte-api-session/init/ecommerce.example/passed': 11}
Now I need to figure out why the number of passes for session initializations is higher than the number of total passes. Feels like more sessions are being initialized than necessary.
I also want to implement a basic default check for setLocation based on the action outcome, i.e. if the action failed consider the initialization failed by default. And maybe try to find out if setLocation is not supported for the target website, and stop the spider in that scenario.