google-search-results-python
google-search-results-python copied to clipboard
Pagination iterator doesn't work for APIs with token-based pagination
For several APIs, parsing the serpapi_pagination.next
is the only way to update params_dict
with correct values. An increment of params.start
won't work for Google Scholar Profiles, Google Maps, YouTube.
https://github.com/serpapi/google-search-results-python/blob/ed7797c132d80613080b11b99f5b137bbeb5c3f5/serpapi/pagination.py#L26-L27
Google Scholar Profiles
Google Scholar Profiles API have pagination.next_page_token
instead of serpapi_pagination.next
.
pagination.next
is a next page URI like https://serpapi.com/search.json?after_author=0QICAGE___8J&engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity
where after_author
is set to next_page_token
.
Google Maps
In Google Maps Local Results API there's only serpapi_pagination.next
with a URI like https://serpapi.com/search.json?engine=google_maps&ll=%4040.7455096%2C-74.0083012%2C14z&q=Coffee&start=20&type=search
YouTube
In YouTube Search Engine Results API there's serpapi_pagination.next_page_token
similar to Google Scholar Profiles. serpapi_pagination.next
is a URI with sp
parameter set to next_page_token
.
@jvmvik What do you think about parsing serpapi_pagination.next
in Pagination#__next__
?
- self.start += self.page_size
+ self.client.params_dict.update(dict(parse.parse_qsl(parse.urlsplit(result['serpapi_pagination']['next']).query)))
Here's an example of endless pagination of Google Scholar Authors (scraped 190 pages and manually stopped).
Good point, I was actually struggling to understand why pagination didn't work for scholar. A consistent API would be great! Thanks for sharing the snippet.
I'd just roll out a new version of the package but I have taking into account your code above. My concern is that Google search provides duplicated search results when pagination. Should we try to build a filtering mechanism on the back end or the client side. So far, we are letting the user deal with this issue. My experiment shows that it's not Serp API fault but the search engine returned similar search result from one page to another. What do you think?
@jvmvik I'd not filter duplicated results on the SerpApi side.
Sorry for the long wait on this. I was working with a couple of wrong assumption. 1- page start, num are always supported / translate if needed. 2- start = f(x * num)
1- On the top the suggestion above, I can implement a mapping table per search engine.
# set default
self.start_key = "start"
self.num_key = "num"
self.end_key = "end"
# override per search engine
if engine == BAIDU_ENGINE:
self.start_key = "pn"
self.num_key = "rn"
So, the response.next
can be parse for most of the search engine except Google Scholar.
2- The way the offset is returned by SerpApi is not consistent. Google takes start=0, num=20
- page 1 : start=0
- page 2 : start=20
- page 3 : start=40 see: tests/test_google_search.py#test_paginate in branch: 2.5.0
Ebay takes pn=0, rn=20
- page 1 : start=2
- page 2 : start=3
- page 3 : start=4
see: tests/test_ebay_search.py#test_paginate in branch: 2.5.0
https://github.com/serpapi/google-search-results-python/pull/new/2.5.0
Could we improve the consistency between Search engines on the backend or the client ? Or do we even care ? The user might not switch back on forth between search engine.
Could we improve the consistency between Search engines on the backend or the client ?
It depends on the target website — we mirror their query parameters. But consistency on the SerpApi backend should be improved too. For example, response for google_scholar_profiles
engine contains pagination
but no serpapi_pagination
.
Currently, a reliable way to consume pagination across all search engines on the client is to use result['serpapi_pagination']['next']
.