newspaper4k
newspaper4k copied to clipboard
[SITES] www.mprnews.org
First please check that it is really an issue with the library, and not some special case of website:
- [x] There is no paywall
- [x] You do not have to be logged in to see the articles
- [x] You tried using a common browser user agent in your configuration / call
- [x] The website is not in the list of well known problematic sites
Your report as follows:
Website that does not parse correctly:
https://www.mprnews.org
Some sample urls that I have tried
https://www.mprnews.org/story/2024/07/09/new-minnesota-state-fair-foods https://www.mprnews.org/story/2024/07/14/severe-storms-barrel-across-minnesota-overnight-leaving-thousands-without-power
The exact code i used to test this articles/website
Made a script called can_parse.py
and ran with each of the urls as an arg with current master. Might be worth adding to the repository as a test script.
import sys
from newspaper.article import Article
url = sys.argv[1]
article = Article(url, fetch_images=False, follow_meta_refresh=True)
article.download()
article.parse()
Other information, remarks, messages, etc:
Traceback (most recent call last):
File "/home/palfrey/src/newspaper4k/can_parse.py", line 8, in <module>
article.parse()
File "/home/palfrey/src/newspaper4k/newspaper/article.py", line 466, in parse
authors = self.extractor.get_authors(self.doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/palfrey/src/newspaper4k/newspaper/extractors/content_extractor.py", line 59, in get_authors
return self.author_extractor.parse(doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/palfrey/src/newspaper4k/newspaper/extractors/authors_extractor.py", line 99, in parse
if "@graph" in script_tag:
^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable