scrapy-zyte-api ZyteApiProvider could make an unneeded API request

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...

Jun 12 '23 10:06 kmike

Findings so far:

https://github.com/scrapinghub/scrapy-poet/pull/151 won’t fix this.
This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.

Jun 15 '23 13:06 Gallaecio

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. https://github.com/scrapinghub/scrapy-poet/pull/151 is actually about a third request done in this or similar use case.

Jun 15 '23 14:06 wRAR

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

Jun 15 '23 15:06 wRAR

Indeed.

Jun 15 '23 15:06 Gallaecio

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why https://github.com/scrapinghub/scrapy-poet/pull/153 works.

Jun 20 '23 09:06 Gallaecio

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

✅ The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...

Oct 03 '23 06:10 BurnzZ

I wonder if some of the unexpected requests are related to https://github.com/scrapy-plugins/scrapy-zyte-api/issues/135.

Oct 03 '23 07:10 Gallaecio

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

Jan 09 '24 12:01 BurnzZ

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

Jan 11 '24 17:01 wRAR

@wRAR I can still reproduce Case 2. 👍

Jan 12 '24 05:01 BurnzZ

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

Jan 12 '24 12:01 wRAR

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

Jan 12 '24 13:01 wRAR

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.

Jan 12 '24 15:01 kmike