newspaper how to use html file in newspaper3k as it work with url page

how to use html file in newspaper3k as it work with url page

Open MeetH15 opened this issue 4 years ago • 13 comments

please help me @yprez

Mar 02 '20 16:03 MeetH15

It would be nice if someone could point to an example that shows how to use html file.

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

Mar 04 '20 06:03 iwpnd

what is response.url @iwpnd ??

Mar 05 '20 03:03 MeetH15

what is response.url @iwpnd ??

just some random url. it will not be used as you provide an input_html anyways.

Mar 05 '20 06:03 iwpnd

when i run it show 'None' as an output @iwpnd can u show output like what u get in ur screen

Mar 06 '20 03:03 MeetH15

I was showing you how to use an HTML, not providing you a valid HTML newspaper article.

Mar 06 '20 05:03 iwpnd

I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html

but not getting same result help?

May 07 '20 17:05 ashkaushik

I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html

but not getting same result help?

Are you wanting to output the extracted content from a news source to a HTML page like the example shows?

Oct 12 '20 16:10 johnbumgarner

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

Dec 04 '20 18:12 taga93

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

Look at this section https://github.com/johnbumgarner/newspaper3_usage_overview#extraction-from-offline-html-files of the overview document that I published on using Newspaper.

Dec 05 '20 00:12 johnbumgarner

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.

UPDATE:

The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.


AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language)
     89 
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node)
    867 
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    927 
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'

May 08 '21 12:05 imrek

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.

UPDATE:

The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.


AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language)
     89 
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node)
    867 
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    927 
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'

@imrek

the first code example didn't follow the syntax of the code example that I posted in my overview document. Please review my code example for processing offline HTML content.

I have never used Fulltext, so I would have to review the code for NewsPaper to see how this function works.

May 09 '21 16:05 johnbumgarner

@imrek I also looked at the function fulltext. I'm not sure what it does different than article.text. According to the code base the syntax of the function requires article.html and not your_html. I tested the function with multiple news sites and received no errors. Also the length of article.text and the output of fulltext_ were the same.

May 09 '21 17:05 johnbumgarner

newspaper newspaper copied to clipboard

how to use html file in newspaper3k as it work with url page

newspaper
newspaper copied to clipboard