[SITES] https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

Open palfrey opened this issue 8 months ago • 0 comments

First please check that it is really an issue with the library, and not some special case of website:

[x] There is no paywall
[x] You do not have to be logged in to see the articles
[x] You tried using a common browser user agent in your configuration / call
[x] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

Some sample urls that I have tried

https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

The exact code i used to test this articles/website

Standard article.parse() just gets 403. Feeding in the raw HTML with Playwright OTOH, gets the error (and I've checked that page.content() just spits out HTML)

        with sync_playwright() as p:
            browser = p.firefox.launch()
            page = browser.new_page()
            page.goto(url)
            article.html = page.content()
            article.parse()

** What parts of the article are missing / not parsed correctly **

Everything because exception

Other information, remarks, messages, etc:

<in my code as per the above>
    article.parse()
lib/python3.11/site-packages/newspaper/article.py:466: in parse
    authors = self.extractor.get_authors(self.doc)
lib/python3.11/site-packages/newspaper/extractors/content_extractor.py:59: in get_authors
    return self.author_extractor.parse(doc)
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in parse
    authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
lib/python3.11/site-packages/newspaper/extractors/authors_extractor.py:131: in <listcomp>
    authors = [re.sub("[\n\t\r\xa0]", " ", x) for x in authors if x]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pattern = '[\n\t\r\xa0]', repl = ' '
string = {'biography': "<p>First published in 1869, <b><i>Nature</i></b> is the world's leading multidisciplinary science journ.../p>", 'contacts': [], 'contentful_id': '7Ek1B681o6mb6QOBg14RKO', 'mura_id': 'A7F2375E-BB3B-4896-8F706A83EEA765D7', ...}
count = 0, flags = 0

    def sub(pattern, repl, string, count=0, flags=0):
        """Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the Match object and must return
        a replacement string to be used."""
>       return _compile(pattern, flags).sub(repl, string, count)
E       TypeError: expected string or bytes-like object, got 'dict'

lib/python3.11/re/__init__.py:185: TypeError

Jun 21 '24 20:06 palfrey

newspaper4k newspaper4k copied to clipboard

[SITES] https://www.scientificamerican.com/article/china-has-plans-for-the-worlds-largest-particle-collider/

First please check that it is really an issue with the library, and not some special case of website:

Your report as follows:

newspaper4k
newspaper4k copied to clipboard