Search-Engines-Scraper icon indicating copy to clipboard operation
Search-Engines-Scraper copied to clipboard

Feature request Yandex and Baidu

Open LeoJavaAI opened this issue 4 years ago • 2 comments

Thanks for your work, Please consider adding Yandex and Baidu if possible

LeoJavaAI avatar Apr 17 '21 12:04 LeoJavaAI

Sounds interesting, I'll see what I can do. I think Yandex is simple enough, but I don't know if we can scrape Baidu without Selenium and I'd like to avoid that.

tasos-py avatar Apr 21 '21 22:04 tasos-py

After some research, I don't think I can add Yandex or Baidu. Yandex keeps giving me a captcha after a couple of requests. Maybe Selenium could help with that, but I want to keep this repo as simple as possible, so I'd rather not add browser automation or OCR dependencies.

Baidu doesn't require Selenium, the problem here is that it doesn't have direct links, the links are like this www.baidu.com/link?url=kh39xCQVnS7frJSxGrpfLAXdudtflGhAhAK8YjhSgpwyf0Sl8L41EGODywKx6Vvqy8UbcOnNGkuEntr1m9KLmq. The url= parameter looks like a base64 string, but it doesn't decode to text and I don't think decoding/decryption is done in client side, the server redirects to the final link. We could use the server to get the actual URLs, but that would be very inefficient and it would probably result in bans.

So, I don't know how to proceed further, if you have any ideas I'd love to hear them.

tasos-py avatar Apr 28 '21 07:04 tasos-py