google-search-results-python
google-search-results-python copied to clipboard
google scholar pagination skips result 20
When retrieving results from Google Scholar using the pagination() method, the first article on the second page of google scholar is always missing.
I think this is caused by the following snippet in the update() method of google-search-results-python/serpapi/pagination.py:
def update(self):
self.client.params_dict["start"] = self.start
self.client.params_dict["num"] = self.num
if self.start > 0:
self.client.params_dict["start"] += 1
This seems to mean that for all pages except the first, paginate increases start by 1. So while for the first page it requests results starting at 0 and ending at 19 (if page_size=20). For the second page it requests results starting at 21 and ending at 40, skipping result 20.
If I delete the if statement, the code seems to work as intended and I get result 19 back.
@jvmvik Can you take a look?
@samuelhaysom Currently, the best approach would be to use serpapi_pagination
instead as you also mentioned in #25 issue. When #30 is merged, the pagination()
method would be the preferred one. Sorry for such a long reply.
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
else:
break
For example:
# Google Scholar Search API
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
params = {
"api_key": "...", # serpapi api key
"engine": "google_scholar", # search engine
"q": "minecraft redstone", # language
"hl": "en" # search query
}
search = GoogleSearch(params) # where data extraction happens
# to show page number
page_num = 0
# iterate over all pages
results_is_present = True
while results_is_present:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
page_num += 1
print(f"Current page: {page_num}")
# iterate over organic results and extract the data
for result in results.get("organic_results", []):
print(result.get("position"), result.get("title"), sep="\n")
# check if the next page key is present in the JSON
# if present -> split URL in parts and update to the next page
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
else:
break