newspaper
newspaper copied to clipboard
`parse` hangs on some files
Hi, I reported the issue on goose3
. I was hoping Newspaper
would not have this problem, but the same issue also occurs with Newspaper
. You can replicate the issue by the following lines:
from newspaper import Article
article = Article('https://gist.githubusercontent.com/ma-ji/2dd9689a01c48bf7323b89d4e6b927d5/raw/21f680df041c9816d9d80faa4af599aa90df90be/raw_html.html')
article.download()
article.parse()
System info:
OS Ubuntu 18.04.5 LTS
Python 3.8.11
IPython 7.26.0
newspaper 0.2.8
The way that you are passing the URL is incorrect. The correct way to pass this URL is this way:
from newspaper import Article
article = Article('https://gist.githubusercontent.com/ma-ji/2dd9689a01c48bf7323b89d4e6b927d5/raw/21f680df041c9816d9d80faa4af599aa90df90be/raw_html.html')
article.download()
article.parse()
print(article.title)
\n\t\n\tRSS Testing\n \n
This way cause no issues.
Also I looked at the source code and it's unclear what this function does:
def set_html(self, html):
"""Encode HTML before setting it
"""
if html:
if isinstance(html, bytes):
html = self.config.get_parser().get_unicode_html(html)
self.html = html
self.download_state = ArticleDownloadState.SUCCESS
It seems that it should be a private function and not one exposed from the Class.
I set_html(html)
because the HTML file is local. I run the exact code you gave but the system still hangs and RAM keeps increasing. How long does it take you to get the results?
My system info:
- Python 3.8.11
- IPython 7.26.0
- newspaper 0.2.8
Thanks for helping!
Since the HTML is in a local file then there is another way to process the file. I show how to process such files in my Newspaper usage overview document. Just let me know If you are still having issues after reviewing my code example.
Thanks for the example. I tried, but still not working. Here are the code and file:
from newspaper import Article
article = Article('', language='en')
article.download(input_html=open("test.txt", 'r').read())
article.parse()
I downloaded your file locally. I was able to access the file with the code below with no issues.
with open("raw_html.html", 'r') as f:
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
print(article.title)
\n\t\n\tRSS Testing\n \n
I changed the file extension in this example.
article = Article('', language='en')
article.download(input_html=open("raw_html.txt", 'r').read())
article.parse()
print(article.title)
\n\t\n\tRSS Testing\n \n
Okay, I tried your code on a Windows machine, it works!
I'm using a Linux server, still not working ...
I'm not sure why it would hang on your Linux server. What server are you using? I also noted that you're using Python , which I haven't used with Newspaper.
I'm using OS Ubuntu 18.04.5 LTS
. I've also updated the first post.