newspaper4k
newspaper4k copied to clipboard
[SITES] https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/
First please check that it is really an issue with the library, and not some special case of website:
- [x] There is no paywall
- [x] You do not have to be logged in to see the articles
- [x] You tried using a common browser user agent in your configuration / call
- [x] The website is not in the list of well known problematic sites
Your report as follows:
Website that does not parse correctly:
https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/
Some sample urls that I have tried
https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/
The exact code i used to test this articles/website
Standard article.parse()
just gets 403. Feeding in the raw HTML with Playwright OTOH, gets the error (and I've checked that page.content()
just spits out HTML)
with sync_playwright() as p:
browser = p.firefox.launch()
page = browser.new_page()
page.goto(url)
article.html = page.content()
article.parse()
** What parts of the article are missing / not parsed correctly **
Everything because exception
Other information, remarks, messages, etc:
<in my code as per the above>
article.parse()
lib/python3.11/site-packages/newspaper/article.py:466: in parse
authors = self.extractor.get_authors(self.doc)
lib/python3.11/site-packages/newspaper/extractors/content_extractor.py:59: in get_authors
return self.author_extractor.parse(doc)
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in parse
authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in <listcomp>
authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pattern = '[\n\t\r\xa0]', repl = ' '
string = {'biography': "<p>First published in 1869, <b><i>Nature</i></b> is the world's leading multidisciplinary science journ.../p>", 'contacts': [], 'contentful_id': '7Ek1B681o6mb6QOBg14RKO', 'mura_id': 'A7F2375E-BB3B-4896-8F706A83EEA765D7', ...}
count = 0, flags = 0
def sub(pattern, repl, string, count=0, flags=0):
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the Match object and must return
a replacement string to be used."""
> return _compile(pattern, flags).sub(repl, string, count)
E TypeError: expected string or bytes-like object, got 'dict'
lib/python3.11/re/__init__.py:185: TypeError