webcrawler
webcrawler copied to clipboard
Web crawler to download pictures from zhihu.com
Traceback (most recent call last): File "F:/PY/20171006webdriber.py", line 88, in main() File "F:/PY/20171006webdriber.py", line 52, in main girls.write(result_bf) UnicodeEncodeError: 'gbk' codec can't encode character '\u2207' in position 86907: illegal multibyte...
I am not familiar with bs4. What is this case about? Or let me ask, what is 'chromedrive'? ```bash Traceback (most recent call last): File "girls_crawler_py27.py", line 87, in main()...
1.知乎的页面改版,已经没有浏览更多,而是往下拖会动态更新出现,因此把execute_times()函数里点击更多那一步去掉 2.写文件的时候,若不加上encoding='utf-8',会报错 3.对于py3.7, 获取node内部内容时,若采用noscript_inner = noscript.get_text(),会提取字符串为空,可以直接 noscript_inner = str(noscript)来转换成对应字符串
代码更新
我发现知乎的html好像更新过了。 原来的查看更多回答变成了查看全部回答。而且最上面和最下面都有这个选项。所以您的这个代码是不是要修改更新一下了?(PS:我是windows7系统下的。) 代码虽然跑出来了,图片也能下载下来。但是好像有点小问题想在问一下您。 def wait_time(times): for i in range(times): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) try: driver.find_element_by_css_selector('button.QuestionMainAction').click() print("page" + str(i)) time.sleep(1) except: break wait_time(5) 我对此进行了修改: time.sleep(2) try: driver.find_element_by_css_selector('.QuestionMainAction').click() time.sleep(1) print('成功') except: print('失败') 因为只需要点击一次...