web-poet icon indicating copy to clipboard operation
web-poet copied to clipboard

Web scraping Page Objects core library

Results 38 web-poet issues
Sort by recently updated
recently updated
newest added

hey! I’m exploring how to make the following work: ```py from zyte_common_items import Product from web_poet import ItemPage, handle_urls @handle_urls("example.com") class ExamplePage(ItemPage[Product]): # ... ``` In other words, how to...

discuss

As explained in https://github.com/scrapy/w3lib/issues/189 and https://github.com/scrapy/scrapy/issues/5601, BOM should take a precedence over Content-Type headers when detecting an encoding. Currently web-poet.HttpResponse prefers Content-Type header: ```py import codecs import web_poet body =...

This feature allows to declare a field as `@field(extra=True)`. Such fields are ignored if the item type doesn't have them. It allows to define fields in the base class, which...

@gatufo has some concerns about the current guidelines: https://github.com/scrapinghub/web-poet/pull/53#issuecomment-1203902479 > For me when combining both Item and Page Objects, Item defines the fields that what you want to extract, and...

At the moment we kind of mix documentation for page object writers and for framework writers. I think it would be best to split the two at the root of...

documentation
enhancement

One neat feature inside Scrapy is it's [LinkExtractors](https://github.com/scrapy/scrapy/blob/64905e3397a5b837312169a0b418857ef1cf40c7/scrapy/linkextractors/lxmlhtml.py) functionality. We usually try to use this whenever we want links to be extracted inside a given page. Inside **web-poet**, we can...

This PR attempts to scrutinize the idea as noted down in https://github.com/scrapinghub/web-poet/pull/42#issuecomment-1141831361. It's built on top of this PR's branch: - https://github.com/scrapinghub/web-poet/pull/42

An experiment: how a w3lib-based URL class could look like. Again, a minimal version.

This is an experiment, to see how "functional" Page Objects could look like, if it's possible to implement them, and to get feedback about the idea.

### Background Given the following PO structure below: ```python import attr from web_poet.pages import Injectable from web_poet.page_inputs import ResponseData @attr.define class HTMLFromResponse(Injectable): response: ResponseData @attr.define class WebPage(Injectable): response: ResponseData @attr.define...

enhancement