newspaper http://www.dw.com，I coudn‘t get urls.

http://www.dw.com，I coudn‘t get urls.

Open huangsiyuan924 opened this issue 1 year ago • 8 comments

Aug 10 '22 10:08 huangsiyuan924

https://www.dw.com/zh/%E5%9C%A8%E7%BA%BF%E6%8A%A5%E5%AF%BC/s-9058?&zhongwen=simp

Aug 10 '22 10:08 huangsiyuan924

Please post your code.

Aug 10 '22 13:08 johnbumgarner

Please post your code. config = Configuration() HEADERS = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36' } config.headers = HEADERS url = 'https://www.dw.com/zh/%E5%9C%A8%E7%BA%BF%E6%8A%A5%E5%AF%BC/s-9058?&zhongwen=simp' sina_paper = newspaper.build(url, config=config) open('index.html', 'w', encoding='utf-8').write(sina_paper.html) print(sina_paper.size()) the result is 2

Aug 10 '22 13:08 huangsiyuan924

Thanks for the code.

I think that your problem is linked to this button, which has to be clicked.

Screen Shot 2022-08-10 at 9 31 37 AM

I wrote a Newspaper3k overview document that talks about these type of issues.

Look at this section of the document for guidance.

Aug 10 '22 13:08 johnbumgarner

Thanks. I think that your problem is linked to this button, which has to be clicked.

I wrote an Newspaper3k overview document that talks about these issues.

Look at this section of the document for guidance.

I found through the newspaper3k source code that the ‘build() ’ method will only download categories and feeds, but will not download the links under the url. I have overridden this method and successfully get the result I want.

Aug 11 '22 03:08 huangsiyuan924

I would be highly interested to see your new code with the overridden method.

Aug 11 '22 04:08 johnbumgarner

I would be highly interested to see your new code with the overridden method.

https://github.com/huangsiyuan924/newspaper/blob/631d97bbeb8f18a9069b9972f74811af2f20c05e/newspaper/source.py#L324-L360 it work for me

Aug 11 '22 05:08 huangsiyuan924

Thanks. It's interesting that build() works for sites, such as cnn.com and bbc.com, but not for dw.com/zh.

Aug 11 '22 12:08 johnbumgarner

newspaper newspaper copied to clipboard

http://www.dw.com，I coudn‘t get urls.

newspaper
newspaper copied to clipboard