newspaper
newspaper copied to clipboard
http://www.dw.com,I coudn‘t get urls.
https://www.dw.com/zh/%E5%9C%A8%E7%BA%BF%E6%8A%A5%E5%AF%BC/s-9058?&zhongwen=simp
Please post your code.
Please post your code.
config = Configuration() HEADERS = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36' } config.headers = HEADERS url = 'https://www.dw.com/zh/%E5%9C%A8%E7%BA%BF%E6%8A%A5%E5%AF%BC/s-9058?&zhongwen=simp' sina_paper = newspaper.build(url, config=config) open('index.html', 'w', encoding='utf-8').write(sina_paper.html) print(sina_paper.size())
the result is 2
Thanks for the code.
I think that your problem is linked to this button, which has to be clicked.
I wrote a Newspaper3k overview document that talks about these type of issues.
Look at this section of the document for guidance.
Thanks. I think that your problem is linked to this button, which has to be clicked.
I wrote an Newspaper3k overview document that talks about these issues.
Look at this section of the document for guidance.
I found through the newspaper3k source code that the ‘build() ’ method will only download categories and feeds, but will not download the links under the url. I have overridden this method and successfully get the result I want.
I would be highly interested to see your new code with the overridden method.
I would be highly interested to see your new code with the overridden method.
I would be highly interested to see your new code with the overridden method.
https://github.com/huangsiyuan924/newspaper/blob/631d97bbeb8f18a9069b9972f74811af2f20c05e/newspaper/source.py#L324-L360 it work for me
Thanks. It's interesting that build() works for sites, such as cnn.com and bbc.com, but not for dw.com/zh.