Allowing subfields to be extracted

Open BurnzZ opened this issue 1 year ago • 5 comments

Stemming from https://github.com/scrapinghub/scrapy-poet/pull/111 where we'd want to implement the API in web-poet itself regarding extracting data from a subset of fields.

API

The main directives that we want to support are:

include_fields
exclude_fields

Using both directives together should not be allowed.

via page object instance

item = partial_item_from_page_obj(product_page, include_fields=["x", "y"])
print(item)  # ProductItem(x=1, y=2, z=None)
print(type(product_page))  # ProductPage

This API assumes we already have an instance of the page object with the appropriate response data in it. Moreover, the item class can be inferred from the page object definition:

class ProductPage(WebPage[ProductItem]):
    ...

Arguably, we could also support page object classes as long as the URL is provided for the response data to be downloaded by the configured downloader.

via item class

Conversely, we could also support directly asking for the item class instead of the page object as long as we have access to the ApplyRule to infer their relationships. Unlike the page object, a single item class could have relationships to multiple page objects, depending on the URL.

But this means that the response should still be downloaded and the downloader is configured.

item = partial_item_from_item_cls(
    ProductItem, include_fields=["x", "y"], url="https://example.com"
)

There are several combinations of scenarios for this type of API.

Page object setup

A. The page object has all fields using the `@field` decorator

This is quite straightforward to support since we can easily do:

from web_poet.fields import get_fields_dict
from web_poet.utils import ensure_awaitable


fields = get_fields_dict(page_obj)
item_dict = item_from_fields_sync(page_obj, item_cls=dict, skip_nonitem_fields=False)
item = page_obj.item_cls(
    **{
        name: await ensure_awaitable(item_dict[name])
        for name in item_dict
        if name in field_names
    }
)

We basically derive all the fields from the page object and call them one-by-one.

B. The page object doesn't use the `@field` decorator but solely utilizes the `.to_item()` method

Alternatively, the page object can be defined as:

class ProductPage(WebPage[ProductItem]):
    def to_item(self) -> ProductItem:
        return ProductItem(x=1, y=2, z=3)

The methodology mentioned in scenario A above won't work here since calling get_fields_dict(page_obj) would result in an empty dict.

Instead, we can simply call the page object's .to_item() method and just include/exclude the given fields from there.

C. The page object has some fields using the `@field` decorator while some fields are populated inside the `.to_item()` method

class ProductPage(WebPage[ProductItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> ProductItem:
        return ProductItem(x=self.x, y=2, z=3)

This scenario is much harder since calling get_fields_dict(page_obj) would result in a non-empty dict: {'x': FieldInfo(name='x', meta=None, out=None)}.

We could try to check if the page object has overridden the .to_item() method by something like page_obj.__class__.to_item == web_poet.ItemPage.to_item. However, we're also not sure if it has added any new fields at all or has simply overridden it to add some post-processing or validation operations. Either way, the resulting field value from the .to_item() method (if it's overridden) could be totally different than calling the field directly.

We could also detect this scenario whenever some fields specified in include_fields=[...] or exclude_fields=[...] are not present in get_fields_dict(page_obj). If so, we can simply call the .to_item() method and include/exclude fields from there.

However, it's a wasteful operation since some fields could be expensive (i.e. having additional requests) and that's why they want to be excluded in the first place. But then, they were still unintentionally called via the .to_item() method.

In this case, we'd rely on the page object developer to design their page objects well and ensure that our docs highlight this caveat.

But still, there's the question of how to handle fields specified in include_fields=[...] or exclude_fields=[...] that are not existing at all. Let's tackle this in the further sections below (Spoiler: it'd be great to not support scenario C).

Handling field presence

I. Some fields specified in `include_fields=[...]` are not existing

An example would be:

@attrs.define
class SomeItem:
    x: Optional[int] = None
    y: Optional[int] = None
    
class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    @field
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_field=["y", "z"])

For this case, we can simply ignore producing the z field value since the page object does not support it.

Moreover, if all of the given fields are not existing at all, partial_item_from_page_obj(some_page, include_fields=["z"]), an empty item would be returned.

Note that this is could be related to scenario C above and we have to be careful since a given field might be declared without using the @field decorator.

class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> SomeItem:
        return SomeItem(x=1, y=2)
        
partial_item_from_page_obj(some_page, include_fields=["y"])

Because of these types of scenarios, it'd be hard to fully trust deriving the fields from a page object via fields = get_fields_dict(page_obj).

SOLUTION 1: we can make it clear to our users via our docs that we will only call .to_item() if the page object explicitly doesn't use any @field decorators. This means that we won't be supporting scenario C at all.

II. Some fields specified in `exclude_fields=[..]` are not existing

The same case with scenario I where we can simply ignore non-existing fields.

However, it has the same problem about supporting .to_item() for scenario C, since there might be some fields that's using the @field decorator while the rest are produced via the .to_item() method.

To err on the side of caution, it could simply call .to_item() and then removing the fields declared in exclude_fields=[...]. Better yet, go with SOLUTION 1 above as well.

III. No fields were given in `include_fields=[...]`

For this, we could simply return an item with empty fields.

If any fields are required but are missing (i.e. None), we simply let Python error it out: TypeError: __init__() missing 1 required positional argument: ....

IV. No fields were given in `exclude_fields=[...]`

We could return the item with full fields, basically calling the .to_item().

Item setup

1. The item has all fields marked as `Optional`

There's no issue with this since including or excluding fields won't result into errors like TypeError: __init__() missing 1 required positional argument: ....

All of the above examples above has this item setup.

2. The item has fields marked as required

For example:

@attrs.define
class SomeItem:
    x: int
    y: int
    
class SomePage(ItemPage[SomeItem]):
    def x(self) -> int:
        return 1
        
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_fields=["x"])

Unlike in scenario 1, this results in TypeError: __init__() missing 1 required positional argument : ... since the y field is required.

One solution is to allow overriding the item class that the page object is returning which removes the required fields. Here's an example:

@attrs.define
class SomeSmallerItem:
    x: int
    
partial_item_from_page_obj(some_page, include_fields=["x"], item_cls=SomeSmallerItem)

The other API could be:

partial_item_from_item_cls(
    SomeItem, include_fields=["x"], url="https://example.com", replace_item_cls=SomeSmallerItem,
)

Summary

We only support page object setups of scenarios A and B while support for C is dropped.

This makes it easier for the item setups in scenario I and II. Scenario III and IV should work whatever the case may be.

The item setup for scenario 1 is straightforward while scenario 2 needs a way to replace/override the item that a page object returns with a smaller version.

Jan 20 '23 12:01 BurnzZ

web-poet web-poet copied to clipboard

Allowing subfields to be extracted

API

via page object instance

via item class

Page object setup

A. The page object has all fields using the @field decorator

B. The page object doesn't use the @field decorator but solely utilizes the .to_item() method

C. The page object has some fields using the @field decorator while some fields are populated inside the .to_item() method

Handling field presence

I. Some fields specified in include_fields=[...] are not existing

II. Some fields specified in exclude_fields=[..] are not existing

III. No fields were given in include_fields=[...]

IV. No fields were given in exclude_fields=[...]

Item setup

1. The item has all fields marked as Optional

2. The item has fields marked as required

Summary

web-poet
web-poet copied to clipboard

A. The page object has all fields using the `@field` decorator

B. The page object doesn't use the `@field` decorator but solely utilizes the `.to_item()` method

C. The page object has some fields using the `@field` decorator while some fields are populated inside the `.to_item()` method

I. Some fields specified in `include_fields=[...]` are not existing

II. Some fields specified in `exclude_fields=[..]` are not existing

III. No fields were given in `include_fields=[...]`

IV. No fields were given in `exclude_fields=[...]`

1. The item has all fields marked as `Optional`