ZhihuCrawler icon indicating copy to clipboard operation
ZhihuCrawler copied to clipboard

【不再维护】知乎爬虫,爬取用户信息和回答;基于Selenium和Scrapy(主要),采用随机ua和ip(需配置)

Results 7 ZhihuCrawler issues
Sort by recently updated
recently updated
newest added

Bumps [cryptography](https://github.com/pyca/cryptography) from 3.2 to 3.3.2. Changelog Sourced from cryptography's changelog. 3.3.2 - 2021-02-07 * **SECURITY ISSUE:** Fixed a bug where certain sequences of ``update()`` calls when symmetrically encrypting very...

dependencies

Bumps [lxml](https://github.com/lxml/lxml) from 4.3.0 to 4.9.1. Changelog Sourced from lxml's changelog. 4.9.1 (2022-07-01) Bugs fixed A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note...

dependencies

Bumps [scrapy](https://github.com/scrapy/scrapy) from 1.5.1 to 2.6.2. Release notes Sourced from scrapy's releases. 2.6.2 Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0....

dependencies

Bumps [twisted](https://github.com/twisted/twisted) from 20.3.0 to 22.10.0. Release notes Sourced from twisted's releases. Twisted 22.10.0 (2022-10-30) This release contains a security fix for CVE-2022-39348. This is a low-severity security bug. Twisted...

dependencies

Bumps [certifi](https://github.com/certifi/python-certifi) from 2018.11.29 to 2022.12.7. Commits 9e9e840 2022.12.07 b81bdb2 2022.09.24 939a28f 2022.09.14 aca828a 2022.06.15.2 de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ... b8eb5e9 2022.06.15.1...

dependencies

例如廖雪峰的关注列表url: https://www.zhihu.com/api/v4/members/liaoxuefeng/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following\%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20 会返回10002error message:"请求参数异常,请升级客户端后重试"

可以考虑在爬虫中断前,把链接暂存到本地文件; 下次爬虫启动的时候,从文件内取出待爬链接进行爬取; 广度优先搜索,分析关注图谱; 将‘/’字符全部去掉; ……待续