Scrapegraph-ai Facing the issue too many requests with GoogleSearch.

With concurrent request to googlesearch, receiving the following:

    642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 429: Too Many Requests

Any idea to add proxy to the google search?

Oct 05 '24 13:10 aziz-ullah-khan

can you share the code please?

Oct 07 '24 08:10 VinciGit00

@VinciGit00

You can replicate the above issue with the sample code below, the same issue we face using ScrapegraphAI with multiple requests:

import concurrent.futures
import time
from googlesearch import search

def fetch_url(query):
    return list(search(query, stop=10)) 

def main():
    query = "Weather in Pakistan"
    batch_size = 50  

    res = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
        future_to_url = {executor.submit(fetch_url, query): i for i in range(batch_size)}
        
        for future in concurrent.futures.as_completed(future_to_url):
            try:
                urls = future.result()
                res.append(urls)  
            except Exception as e:
                print(f"Error fetching data: {e}")

    return res

if __name__ == "__main__":
    result = main()
    print(len(result))

Need proxy to avoid the too many request issue.

Oct 08 '24 15:10 aziz-ullah-khan

ok but how do you integrate it with scrapegraph?

Oct 08 '24 15:10 VinciGit00

@VinciGit00, Basically, In scrapegraphai we are using google search, but we need to replace with the following to have the proxy as input parameter:

Package: googlesearch-python

from googlesearch import search
search(query, num_results=max_result, proxy = proxy)

Oct 08 '24 15:10 aziz-ullah-khan

ok I will update

Oct 10 '24 09:10 VinciGit00

ok pls update to the new beta

Oct 11 '24 09:10 VinciGit00