[SITES] www.mprnews.org

Open palfrey opened this issue 7 months ago • 0 comments

First please check that it is really an issue with the library, and not some special case of website:

[x] There is no paywall
[x] You do not have to be logged in to see the articles
[x] You tried using a common browser user agent in your configuration / call
[x] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.mprnews.org

Some sample urls that I have tried

https://www.mprnews.org/story/2024/07/09/new-minnesota-state-fair-foods https://www.mprnews.org/story/2024/07/14/severe-storms-barrel-across-minnesota-overnight-leaving-thousands-without-power

The exact code i used to test this articles/website

Made a script called can_parse.py and ran with each of the urls as an arg with current master. Might be worth adding to the repository as a test script.

import sys

from newspaper.article import Article

url = sys.argv[1]
article = Article(url, fetch_images=False, follow_meta_refresh=True)
article.download()
article.parse()

Other information, remarks, messages, etc:

Traceback (most recent call last):
  File "/home/palfrey/src/newspaper4k/can_parse.py", line 8, in <module>
    article.parse()
  File "/home/palfrey/src/newspaper4k/newspaper/article.py", line 466, in parse
    authors = self.extractor.get_authors(self.doc)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/content_extractor.py", line 59, in get_authors
    return self.author_extractor.parse(doc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palfrey/src/newspaper4k/newspaper/extractors/authors_extractor.py", line 99, in parse
    if "@graph" in script_tag:
       ^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable

Jul 14 '24 16:07 palfrey

newspaper4k newspaper4k copied to clipboard

[SITES] www.mprnews.org

First please check that it is really an issue with the library, and not some special case of website:

Your report as follows:

newspaper4k
newspaper4k copied to clipboard