scrapy-zyte-api Provide an MVP implementation of a session middleware

To do:

[ ] Confirm the user-facing API is as agreed with @VMRuiz and @proway2.
[x] Make existing tests pass.
- [x] Restore compatibility with Scrapy 2.0.1+.
[x] Perform a minimal test with a real spider.
[ ] Provide complete test coverage.
[x] Add more session links to the docs once client-managed session documentation has been published.
[ ] Get Zyte API to fail with a proper status code for expired sessions
[ ] Document the entire configuration needed (disabling cookies, using a custom retry policy that does not retry download errors, enabling a body-providing field for providers)

Apr 17 '24 17:04 Gallaecio

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.56%. Comparing base (a2284c8) to head (c3ed86f). Report is 6 commits behind head on main.

:exclamation: Current head c3ed86f differs from pull request most recent head c01d368

Please upload reports for the commit c01d368 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #193      +/-   ##
==========================================
- Coverage   98.45%   97.56%   -0.90%     
==========================================
  Files          13       14       +1     
  Lines        1102     1476     +374     
  Branches        0      309     +309     
==========================================
+ Hits         1085     1440     +355     
+ Misses         17       15       -2     
- Partials        0       21      +21

Files	Coverage Δ
scrapy_zyte_api/__init__.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/_middlewares.py	`97.65% <100.00%> (ø)`
scrapy_zyte_api/_session.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/addon.py	`98.07% <100.00%> (-1.93%)`	:arrow_down:
scrapy_zyte_api/utils.py	`100.00% <100.00%> (ø)`

... and 6 files with indirect coverage changes

Apr 18 '24 07:04 codecov[bot]

I have created a project based on https://github.com/zytedata/zyte-spider-templates-project, added the following spider to it:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check_session(self, request: Request, response: Response) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

And executed it as follows:

scrapy crawl session_ecommerce -a "url=https://ecommerce.example/product-list" -a crawl_strategy=pagination_only

It seems to work well enough. Although retries can be exceeded, i.e. a higher RETRY_TIMES value might be something worth recommending in the docs.

May 15 '24 15:05 Gallaecio

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

May 16 '24 12:05 proway2

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

For a short crawl I performed just now:

'scrapy-zyte-api/processed': 21
'scrapy-zyte-api/status_codes/200': 18
'scrapy-zyte-api/status_codes/521': 3
'zyte-api-session/checks/passed': 18
'zyte-api-session/sessions': 11

i.e. sessions were created 11 times because the default is creating 8, and 3 got 521. Once a valid session was created, all (7) usages succeeded. Of course, it is a rather small sample.

In any case, I will now try to build some of the ideas by @VMRuiz, of a better API for location, into this PR (including per-domain web-poet-like configurations).

May 16 '24 22:05 Gallaecio

https://github.com/scrapy-plugins/scrapy-zyte-api/pull/193/commits/0638df7942b62ca13b937d9f9a275d59a2ac6dff is based on the ideas shared by @VMRuiz elsewhere.

It enables both a location-specific approach with poet-based overrides:

from logging import getLogger
from typing import Any

from pydantic import BaseModel, Field
from pydantic.types import Json
from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from scrapy_spider_metadata import Args
from scrapy_zyte_api import SessionConfig, session_config
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.base import ARG_SETTING_PRIORITY
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


@session_config("ecommerce.example")
class EcommerceExampleLocationSessionConfig(SessionConfig):

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        expected_zip_code = self.location(request)["postalCode"]
        if zip_code == expected_zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{expected_zip_code!r})."
        )
        return False


class LocationParam(BaseModel):
    location: Json[Any] = Field(default_factory=dict)


class LocationSpiderParams(LocationParam, EcommerceSpiderParams):
    pass


class LocationEcommerceSpider(EcommerceSpider, Args[LocationSpiderParams]):
    name = "location_ecommerce"

    @classmethod
    def from_crawler(cls, crawler: Crawler, *args, **kwargs) -> Spider:
        spider = super(LocationEcommerceSpider, cls).from_crawler(crawler, *args, **kwargs)
        if spider.args.location:
            spider.settings.set(
                "ZYTE_API_SESSION_ENABLED",
                True,
                priority=ARG_SETTING_PRIORITY,
            )
            spider.settings.set(
                "ZYTE_API_SESSION_LOCATION",
                spider.args.location,
                priority=ARG_SETTING_PRIORITY,
            )
        return spider

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

But it also supports a non-location-specific approach, as well as a non-poet-like definition of session initialization parameters and a check function:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

Also, multiple session pools are now supported, and by default each domain has its own pool. The stats now show check passes and failures separately depending on whether they happened during session initialization or regular response checks, and are split per session pool:

 'zyte-api-session/checks/ecommerce.example/passed': 7,
 'zyte-api-session/init/ecommerce.example/failed': 2,
 'zyte-api-session/init/ecommerce.example/passed': 11}

Now I need to figure out why the number of passes for session initializations is higher than the number of total passes. Feels like more sessions are being initialized than necessary.

I also want to implement a basic default check for setLocation based on the action outcome, i.e. if the action failed consider the initialization failed by default. And maybe try to find out if setLocation is not supported for the target website, and stop the spider in that scenario.

May 17 '24 10:05 Gallaecio

scrapy-zyte-api scrapy-zyte-api copied to clipboard

Provide an MVP implementation of a session middleware

Codecov Report

scrapy-zyte-api
scrapy-zyte-api copied to clipboard