wks
wks copied to clipboard
百度文库爬虫 Baidu Wenku Spider 百度文库下载器
用了会员的cookice也无法下载页码数据  会员的也要点击阅读全文,这个怎么破
大佬考虑放个打赏码上去吗?
C:\wk\venv\Scripts\python.exe C:/wk/main.py https://wenku.baidu.com/view/2d0ce3490875f46527d3240c844769eae109a36e.html?fr=income2-doc-search&_wkts_=1705377009639&wkQuery=%E5%8D%8F%E8%AE%AE%E4%B9%A6 -C Cookies.txt Download from https://wenku.baidu.com/view/2d0ce3490875f46527d3240c844769eae109a36e.html https://wenku.baidu.com/view/2d0ce3490875f46527d3240c844769eae109a36e.html?edtMode=2 Download HTML...正在进行安全检测...window.location.href="https://seccaptcha.baidu.com/v1/webapi/verint/svcp.html?ak=M7bcdh2k6uqtYV5miaRiI8m8x6LIaONq&backurl=https%3A%2F%2Fwenku.baidu.com%2Fview%2F2d0ce3490875f46527d3240c844769eae109a36e.html%3FedtMode%3D2&ext=DA8V%2F7TOwHUYOU0BBeY%2FEuWe1GkkyFJcjaiTZnM7zg02LeoKw3Phw7u0k1rzxHAKJR0vWSFC7blfTgFhudZ8HZpIu%2ByYocN%2Bdlb3Sv7Lx4U%3D&subid=pcview_html_bfe&ts=1705408323&sign=124819d91f7a5a613b3b0c6fc17f0335"; Success. Parse HTML...Error! It is not a Baidu Wenku document.
关于使用教程
有哪位好心的大哥 有见过这个程序的使用教程吗 我刚刚入门 很多参数不知道在哪里填写? _Originally posted by @ahjdfeifei in https://github.com/BoyInTheSun/wks/issues/9#issuecomment-1179583660_
在下载[案例分析--林肯电气公司的激励制度](https://wenku.baidu.com/view/967dcf3181d049649b6648d7c1c708a1294a0a71.html)时,发现第二页的字体文件[font_csss](https://wkretype.bdimg.com/retype/pipe/967dcf3181d049649b6648d7c1c708a1294a0a71?pn=2&t=ttf&rn=1&v=6&md5sum=9441fec5acb7cb05f0ffaabb2103b9dc&range=54694-&sign=c1bb0c9bba)无法获取。后来我做了两点改进解决了这个问题: 第一点,通过抓取我发现字体文件的url不对 首先在这里获取coverUrl的ID,`cover = re.search(r'https://wkimg.bdimg.com/img/(.*?)\?', html).group(1)` https://github.com/BoyInTheSun/wks/blob/b2ece163e1f0bee505d81f6f751ef7afef85f324/main.py#L105 然后把这里的temp_dir改为cover https://github.com/BoyInTheSun/wks/blob/b2ece163e1f0bee505d81f6f751ef7afef85f324/main.py#L166 最后得到可用的url,[font_csss](https://wkretype.bdimg.com/retype/pipe/f1c1c7c10740be1e640e9a81?pn=2&t=ttf&rn=1&v=6&md5sum=9441fec5acb7cb05f0ffaabb2103b9dc&range=54694-&sign=c1bb0c9bba) 第二点,我发现urllib无法正常获取这个url的数据,换成requests就可以获取了 ```python page = requests.get(url=fonts_csss[pagenums[i]], headers=headers) raw = page.text ``` 修改之后,就可以完美下载整个文件了,感觉requests比urllib舒服多了
C:\Users\123\Desktop\wks-main>python main.py -C cookie.txt -o 1.pdf "https://wenku.baidu.com/view/8dc157a94b35eefdc8d33398.html" Download from https://wenku.baidu.com/view/8dc157a94b35eefdc8d33398.html Download HTML...Success. Parse HTML...Success. title: 霍尼韦尔DCS操作手册(通用) Found pdf file, prepare for download...Success. page: 1-45 Start downloading font(s)... |=================================================>| 45 /...
在下[这篇](https://wenku.baidu.com/view/800fbae05b8102d276a20029bd64783e08127d6d.html )这篇文章的时候出错,代码如下: D:\wks>python main.py https://wenku.baidu.com/view/800fbae05b8102d276a20029bd64783e08127d6d.html -C cookies.txt Download from https://wenku.baidu.com/view/800fbae05b8102d276a20029bd64783e08127d6d.html Download HTML...Success. Parse HTML...Success. title: 气井产能确定方法归类总结 Found word file, prepare for download...Success. Start downloading font(s)... |=================================================>| 22 / 22 (100.00%)...