google-search-results-python icon indicating copy to clipboard operation
google-search-results-python copied to clipboard

Pagination iterator doesn't work for APIs with token-based pagination

Open ilyazub opened this issue 3 years ago • 5 comments

For several APIs, parsing the serpapi_pagination.next is the only way to update params_dict with correct values. An increment of params.start won't work for Google Scholar Profiles, Google Maps, YouTube.

https://github.com/serpapi/google-search-results-python/blob/ed7797c132d80613080b11b99f5b137bbeb5c3f5/serpapi/pagination.py#L26-L27

Google Scholar Profiles

Google Scholar Profiles API have pagination.next_page_token instead of serpapi_pagination.next.

pagination.next is a next page URI like https://serpapi.com/search.json?after_author=0QICAGE___8J&engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity where after_author is set to next_page_token.

Google Maps

In Google Maps Local Results API there's only serpapi_pagination.next with a URI like https://serpapi.com/search.json?engine=google_maps&ll=%4040.7455096%2C-74.0083012%2C14z&q=Coffee&start=20&type=search

YouTube

In YouTube Search Engine Results API there's serpapi_pagination.next_page_token similar to Google Scholar Profiles. serpapi_pagination.next is a URI with sp parameter set to next_page_token.

@jvmvik What do you think about parsing serpapi_pagination.next in Pagination#__next__?

- self.start += self.page_size
+ self.client.params_dict.update(dict(parse.parse_qsl(parse.urlsplit(result['serpapi_pagination']['next']).query)))

Here's an example of endless pagination of Google Scholar Authors (scraped 190 pages and manually stopped).

ilyazub avatar Jun 15 '21 15:06 ilyazub

Good point, I was actually struggling to understand why pagination didn't work for scholar. A consistent API would be great! Thanks for sharing the snippet.

kikohs avatar Jun 21 '21 11:06 kikohs

I'd just roll out a new version of the package but I have taking into account your code above. My concern is that Google search provides duplicated search results when pagination. Should we try to build a filtering mechanism on the back end or the client side. So far, we are letting the user deal with this issue. My experiment shows that it's not Serp API fault but the search engine returned similar search result from one page to another. What do you think?

jvmvik avatar Jul 26 '21 17:07 jvmvik

@jvmvik I'd not filter duplicated results on the SerpApi side.

ilyazub avatar Jul 27 '21 06:07 ilyazub

Sorry for the long wait on this. I was working with a couple of wrong assumption. 1- page start, num are always supported / translate if needed. 2- start = f(x * num)

1- On the top the suggestion above, I can implement a mapping table per search engine.

    # set default
    self.start_key = "start"
    self.num_key = "num"
    self.end_key = "end"

    # override per search engine
    if engine == BAIDU_ENGINE:
      self.start_key = "pn"
      self.num_key = "rn"

So, the response.next can be parse for most of the search engine except Google Scholar.

2- The way the offset is returned by SerpApi is not consistent. Google takes start=0, num=20

  • page 1 : start=0
  • page 2 : start=20
  • page 3 : start=40 see: tests/test_google_search.py#test_paginate in branch: 2.5.0

Ebay takes pn=0, rn=20

  • page 1 : start=2
  • page 2 : start=3
  • page 3 : start=4

see: tests/test_ebay_search.py#test_paginate in branch: 2.5.0

https://github.com/serpapi/google-search-results-python/pull/new/2.5.0

Could we improve the consistency between Search engines on the backend or the client ? Or do we even care ? The user might not switch back on forth between search engine.

jvmvik avatar Sep 13 '21 03:09 jvmvik

Could we improve the consistency between Search engines on the backend or the client ?

It depends on the target website — we mirror their query parameters. But consistency on the SerpApi backend should be improved too. For example, response for google_scholar_profiles engine contains pagination but no serpapi_pagination.

Currently, a reliable way to consume pagination across all search engines on the client is to use result['serpapi_pagination']['next'].

ilyazub avatar Sep 14 '21 09:09 ilyazub