web-poet icon indicating copy to clipboard operation
web-poet copied to clipboard

Guidelines for using item_cls_fields: True or False by default?

Open kmike opened this issue 1 year ago • 2 comments

@gatufo has some concerns about the current guidelines:

https://github.com/scrapinghub/web-poet/pull/53#issuecomment-1203902479

For me when combining both Item and Page Objects, Item defines the fields that what you want to extract, and Page Objects become just providers of those fields. That's why I think the default value should be True here to allow (or encourage) this use by default.

kmike avatar Aug 03 '22 12:08 kmike

I think that the current guidelines, which encourage using item_cls_fields=False by default in some scenarios, have a practical benefit: they allow to detect more errors. Unless there are some practical issues with this approach I'm -1 to change the current guidelines, but I might be missing something (as often happens :).

What are the scenarios where the current guidelines would lead to worse code, or more undetected errors?

kmike avatar Aug 03 '22 12:08 kmike

I'm inclined with having item_cls_fields=False as the default because it throws off an explicit error for undefined fields.

However, let's explore this in the perspective of POs as simply "filling up" items with values.

Suppose we have the following items with the corresponding fields:

  • Item A having these fields: foo, bar, zen
  • Item B having these fields: foo, bar,

Moreover, let's suppose that we have a PO with the following methods defined with the @field decorator:

  • foo
  • bar
  • extra

Having item_cls_fields=False would lead to these errors:

  1. __init__() got an unexpected keyword argument 'extra'.
  2. Item A having __init__() missing 1 required positional argument: 'zen'.

Setting item_cls_fields=True wouldn't raise an exception at all for (1). This is convenient because:

  • Both Item A and Item B might NOT want to extract the field named extra. This could be useful if such a field requires an additional request. If the error is raised by the item_cls_fields=False default, then it might encourage developers to define the extra field in the item to make the error go away. They have the option to override item_cls_fields=True but that's another extra step. It might occur to them that they could just define the extra field just in case they'd need it sometime in the future (without a proper idea how much it costs to extract that field).

But it would still raise an exception for (2), unless it's defined as zen: Optional[str] = None.


If we set item_cls_fields=False as the default, it'd error out on the first run if the items have missing fields like in point (1). For me, this would be okay since it forces users to contemplate if this is indeed the expected behavior or should users switch to item_cls_fields=True for the specific use case. This is consistent with the points mentioned in https://web-poet.readthedocs.io/en/latest/advanced/fields.html#error-prevention.

Conversely, if we set item_cls_fields=True, an exception like in point (1) wouldn't be raised at all. Users could miss out on the fact that some field from the PO hasn't been utilized. Most probably, item_cls_fields=False wouldn't come up to the user's mind since no warning or exceptions were raised at all.

If we're using the POs to fill-up item classes with values, then we need to ensure that it's indeed the expected behavior by explicitly overriding item_cls_fields=True.

TL;DR My vote is to keep item_cls_fields=False as the default.

BurnzZ avatar Aug 04 '22 07:08 BurnzZ

It's released with skip_nonitem_fields as False by default.

kmike avatar Sep 21 '22 18:09 kmike