newspaper `parse` hangs on some files

`parse` hangs on some files

Open ma-ji opened this issue 2 years ago • 8 comments

Hi, I reported the issue on goose3 . I was hoping Newspaper would not have this problem, but the same issue also occurs with Newspaper. You can replicate the issue by the following lines:

from newspaper import Article
article = Article('https://gist.githubusercontent.com/ma-ji/2dd9689a01c48bf7323b89d4e6b927d5/raw/21f680df041c9816d9d80faa4af599aa90df90be/raw_html.html')
article.download()
article.parse()

System info:

OS Ubuntu 18.04.5 LTS
Python 3.8.11
IPython 7.26.0
newspaper 0.2.8

Sep 20 '21 05:09 ma-ji

The way that you are passing the URL is incorrect. The correct way to pass this URL is this way:

from newspaper import Article

article = Article('https://gist.githubusercontent.com/ma-ji/2dd9689a01c48bf7323b89d4e6b927d5/raw/21f680df041c9816d9d80faa4af599aa90df90be/raw_html.html')
article.download()
article.parse()
print(article.title)
\n\t\n\tRSS Testing\n \n

This way cause no issues.

Also I looked at the source code and it's unclear what this function does:

 def set_html(self, html):
        """Encode HTML before setting it
        """
        if html:
            if isinstance(html, bytes):
                html = self.config.get_parser().get_unicode_html(html)
            self.html = html
            self.download_state = ArticleDownloadState.SUCCESS

It seems that it should be a private function and not one exposed from the Class.

Sep 21 '21 00:09 johnbumgarner

I set_html(html) because the HTML file is local. I run the exact code you gave but the system still hangs and RAM keeps increasing. How long does it take you to get the results?

My system info:

Python 3.8.11
IPython 7.26.0
newspaper 0.2.8

Thanks for helping!

Sep 22 '21 03:09 ma-ji

Since the HTML is in a local file then there is another way to process the file. I show how to process such files in my Newspaper usage overview document. Just let me know If you are still having issues after reviewing my code example.

Sep 22 '21 03:09 johnbumgarner

test.txt

Thanks for the example. I tried, but still not working. Here are the code and file:

from newspaper import Article
article = Article('', language='en')
article.download(input_html=open("test.txt", 'r').read())
article.parse()

Sep 22 '21 03:09 ma-ji

I downloaded your file locally. I was able to access the file with the code below with no issues.

with open("raw_html.html", 'r') as f:
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    print(article.title)
    \n\t\n\tRSS Testing\n \n

I changed the file extension in this example.

article = Article('', language='en')
article.download(input_html=open("raw_html.txt", 'r').read())
article.parse()
print(article.title) 
\n\t\n\tRSS Testing\n \n

Sep 22 '21 03:09 johnbumgarner

Okay, I tried your code on a Windows machine, it works!

I'm using a Linux server, still not working ...

Sep 22 '21 04:09 ma-ji

I'm not sure why it would hang on your Linux server. What server are you using? I also noted that you're using Python , which I haven't used with Newspaper.

Sep 22 '21 21:09 johnbumgarner

I'm using OS Ubuntu 18.04.5 LTS. I've also updated the first post.

Sep 23 '21 03:09 ma-ji

newspaper newspaper copied to clipboard

`parse` hangs on some files

newspaper
newspaper copied to clipboard