newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

how to use html file in newspaper3k as it work with url page

Open MeetH15 opened this issue 4 years ago • 13 comments

please help me @yprez

MeetH15 avatar Mar 02 '20 16:03 MeetH15

It would be nice if someone could point to an example that shows how to use html file.

animesh-sharama avatar Mar 03 '20 17:03 animesh-sharama

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

iwpnd avatar Mar 04 '20 06:03 iwpnd

what is response.url @iwpnd ??

MeetH15 avatar Mar 05 '20 03:03 MeetH15

what is response.url @iwpnd ??

just some random url. it will not be used as you provide an input_html anyways.

iwpnd avatar Mar 05 '20 06:03 iwpnd

when i run it show 'None' as an output @iwpnd can u show output like what u get in ur screen

MeetH15 avatar Mar 06 '20 03:03 MeetH15

I was showing you how to use an HTML, not providing you a valid HTML newspaper article.

iwpnd avatar Mar 06 '20 05:03 iwpnd

I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html

but not getting same result help?

ashkaushik avatar May 07 '20 17:05 ashkaushik

I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html

but not getting same result help?

Are you wanting to output the extracted content from a news source to a HTML page like the example shows?

johnbumgarner avatar Oct 12 '20 16:10 johnbumgarner

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

taga93 avatar Dec 04 '20 18:12 taga93

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

Look at this section https://github.com/johnbumgarner/newspaper3_usage_overview#extraction-from-offline-html-files of the overview document that I published on using Newspaper.

johnbumgarner avatar Dec 05 '20 00:12 johnbumgarner

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.

UPDATE:

The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.


AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language)
     89 
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node)
    867 
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    927 
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'

imrek avatar May 08 '21 12:05 imrek

from newspaper import Article

your_html = """
index.html
<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <title></title>
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">
</head>

<body>

  <p>Hello, world!</p>

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
  <script src="js/script.js"></script>
</body>

</html>
"""

article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.

UPDATE:

The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.


AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language)
     89 
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node)
    867 
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    927 
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'

@imrek

the first code example didn't follow the syntax of the code example that I posted in my overview document. Please review my code example for processing offline HTML content.

I have never used Fulltext, so I would have to review the code for NewsPaper to see how this function works.

johnbumgarner avatar May 09 '21 16:05 johnbumgarner

@imrek I also looked at the function fulltext. I'm not sure what it does different than article.text. According to the code base the syntax of the function requires article.html and not your_html. I tested the function with multiple news sites and received no errors. Also the length of article.text and the output of fulltext_ were the same.

johnbumgarner avatar May 09 '21 17:05 johnbumgarner