scrapy-zyte-api
scrapy-zyte-api copied to clipboard
ZyteApiProvider could make an unneeded API request
In the example below ZyteApiProvide makes 2 API requests instead of 1:
@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
html: BrowserHtml
# ...
class MySpider(scrapy.Spider):
# ...
def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
# ...
Findings so far:
- https://github.com/scrapinghub/scrapy-poet/pull/151 won’t fix this.
- This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.
Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. https://github.com/scrapinghub/scrapy-poet/pull/151 is actually about a third request done in this or similar use case.
We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.
New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why https://github.com/scrapinghub/scrapy-poet/pull/153 works.
I looked into this further and it still occurs without any Page Objects involved.
The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.
Given the following spider:
class BooksSpider(scrapy.Spider):
name = "books"
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com",
callback=self.parse_nav,
meta={"zyte_api": {"browserHtml": True}},
)
Case 1
✅ The following callback set up is correct since it has only 1 request:
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
...
Case 2
❌ However, the following has 2 separate requests:
# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
...
This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.
Case 3
However, if we introduce a Page Object to the same spider:
@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
response: BrowserResponse
nav_item: ProductNavigation
@field
def url(self):
return self.nav_item.url
@field
def categoryName(self) -> str:
return f"(modified) {self.nav_item.categoryName}"
❌ Then, the following callback set up would have 3 separate Zyte API Requests:
# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
...
Note that the same series of 3 separate requests still occurs on:
def parse_nav(self, response, navigation: ProductNavigation):
...
I wonder if some of the unexpected requests are related to https://github.com/scrapy-plugins/scrapy-zyte-api/issues/135.
Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.
@BurnzZ so do you think after your latest analysis that case 2 still happens or not?
@wRAR I can still reproduce Case 2. 👍
OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike
OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?
@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.