web-poet icon indicating copy to clipboard operation
web-poet copied to clipboard

Introduce alternative constructors to handle nested dependencies [migrate logic from scrapy-poet]

Open BurnzZ opened this issue 2 years ago • 1 comments

Background

Given the following PO structure below:

import attr

from web_poet.pages import Injectable
from web_poet.page_inputs import ResponseData


@attr.define
class HTMLFromResponse(Injectable):
    response: ResponseData


@attr.define
class WebPage(Injectable):
    response: ResponseData


@attr.define
class HTMLWebPage(WebPage):
    html: HTMLFromResponse

The following would not work since HTMLWebPage is now a subclass of WebPage and it effectively requires both response: ResponseData and html: HTMLFromResponse when using its constructor:

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> page = HTMLWebPage(response)
TypeError: __init__() missing 1 required positional argument: 'html'

We'll need to provide both of the required constructor arguments:

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> html = HTMLFromResponse(response)
>>> page = HTMLWebPage(response, html)

This is a bit tedious since underneath the code, the actual core dependency in the tree would only be ResponseData. If the PO we're instantiating has a deeply nested depedency structure, it would be hard to keep track of all the necessary constructor arguments.

However, when POs are used in a Scrapy Project which uses the InjectionMiddleware provided by https://github.com/scrapinghub/scrapy-poet, this doesn't become a problem since it takes care of handling all necessary dependencies for the PO (since it uses https://github.com/scrapinghub/andi underneath):

import scrapy
from poet_injection_in_scrapy.page_objects import HTMLWebPage


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
	}
    }

    # scrapy-poet provides all the necessary dependencies needed by HTMLWebPage
    def parse(self, response, page: HTMLWebPage):
    	pass

Problem

@gatufo raise a good point about using POs outside the context of a Scrapy project, but ultimately withholds access to the dependency resolution conveniently provided by https://github.com/scrapinghub/scrapy-poet.

Nonetheless, this also expands the use cases supported by POs beyond the spider, like using it in a script, deploying it behind an API, etc.

Proposal

This issue aims to discuss and explore the possibilities of moving the necesary injection logic already implemented in scrapy-poet (reference module) into web-poet itself.

The said migrated logic could then be accessed via the alternative constructor named from_response() (see example below).

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> page = HTMLWebPage.from_response(response)

from_response() could be renamed to something else, but this closely follows Scrapy's conventions in its alternative constructors like from_crawler(), from_settings(), etc.

BurnzZ avatar Dec 15 '21 05:12 BurnzZ

I'm not sure how it could work. The referenced code in scrapy-poety is scrapy-specific, that's why it's in scrapy-poet package, not in web-poet.

Also, page objects may define all kinds of dependencies, which are not realted to Scrapy's response; from_response constructor looks quite limited. There could be page objects which don't need Scrapy response, and page objects which need something which doesn't come from a Scrapy response (i.e. response is not enough). There could be lots of different ways to get these dependencies: extract them from some already present data (like from_response), do some async requests (using twisted deferreds? using asyncio? etc). It could also be the case that dependencies are inspected on import time, and they're actually provided on a different machine.

So, -1 to add from_response to Page Objects, because it ties web-poet to Scrapy, and also it's not generic enough. It's an anti-pattern to use something like from_response, because it'd mean that only a certain kind of Page Objects is supported. That's not an issue if a framework like scrapy-poet is used, which relies on andi.

Overall, if we can extract some code to create nested dependencies, it'd be good, but I'm not sure how to do it, beyond what andi is providing.

Probably what's missing is something simple, probably non-asynchronous framework, to run page objects, which can be used for quick tests, in IPython notebooks, etc. But that's a separate discussion.

kmike avatar May 06 '22 18:05 kmike