icrawler icon indicating copy to clipboard operation
icrawler copied to clipboard

Duck Duck Go Search

Open prairie-guy opened this issue 5 years ago • 0 comments

It would be great to have Duck Duck Go implemented within the icrawler framework. I created my own script, based upon other code (attribution provided below). My code does not conform to the icrawler framework style. It does nothing more than search from images on DDG and return URLs. I’ve looked through the icrawler framework and I’m not proficient to be able to implement it in this style. If you like, I could put something together as a pull request that would provide a minimally viable DDG engine within the framework. Alternatively, I post the code here is someone else wants to implement it themselves:


### image_search_ddg.py                                                                                                                               
### C. Bryan Daniels                                                                                                                                  
### 9/1/2020                                                                                                                                          
### Adopted from https://github.com/deepanprabhu/duckduckgo-images-api                                                                                
###                                                                                                                                                   

import requests, re, json, time, sys

headers = {'authority':'duckduckgo.com','accept':'application/json,text/javascript,*/*; q=0.01','sec-fetch-dest':'empty',
        'x-requested-with':'XMLHttpRequest',
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'sec-fetch-site':'same-origin','sec-fetch-mode':'cors','referer':'https://duckduckgo.com/','accept-language':'en-US,en;q=0.9'}

def image_search_ddg(keywords,max_n=100):
    """Search for 'keywords' with DuckDuckGo and return a unique urls of 'max_n' images"""
    url = 'https://duckduckgo.com/'
    params = {'q':keywords}
    res = requests.post(url,data=params)
    searchObj = re.search(r'vqd=([\d-]+)\&',res.text)
    if not searchObj: print('Token Parsing Failed !'); return
    params = (('l','us-en'),('o','json'),('q',keywords),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
    requestUrl = url + 'i.js'
    urls = []
    while True:
        try:
            res = requests.get(requestUrl,headers=headers,params=params)
            data = json.loads(res.text)
            for obj in data['results']:
                urls.append(obj['image'])
                max_n = max_n - 1
                if max_n < 1: return print_uniq(urls)
            if 'next' not in data: return print_uniq(urls)
            requestUrl = url + data['next']
        except:
            pass

def print_uniq(urls):
    for url in set(urls):
        print(url)

if __name__ == "__main__": 
    if len(sys.argv)    == 2: image_search_ddg(sys.argv[1])
    elif len(sys.argv)  == 3: image_search_ddg(sys.argv[1],int(sys.argv[2]))
    else: print("usage: search(keywords,max_n=100)")

prairie-guy avatar Sep 02 '20 00:09 prairie-guy