[SITES] https://calarasipress.ro/

Open TudorAndrei opened this issue 10 months ago • 0 comments

First please check that it is really an issue with the library, and not some special case of website:

[x] There is no paywall
[x] You do not have to be logged in to see the articles
[x] You tried using a common browser user agent in your configuration / call
[x] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://calarasipress.ro/au-sunat-alarmele-la-calarasi-oamenii-nu-au-stiut-ce-se-intampla/img_3495/

The others work as intended

www.example.com/article1 www.example.com/article2

The exact code i used to test this articles/website

# load html manually
at = Article(url=None)
at.download(html, title="")
at.parse()
at.text

** What parts of the article are missing / not parsed correctly **

[ ] Title
[x] Text Content
[ ] Publication Date
[ ] Authors
[ ] Images
[ ] Movies

Other information, remarks, messages, etc:

The extractor extracts the lines from "Breaking News" as the content of the article. This is not obvious, because the content is present in the html, but the user needs to hover on the "Breaking News" tab to see the content.

Mar 28 '24 13:03 TudorAndrei

newspaper4k newspaper4k copied to clipboard

[SITES] https://calarasipress.ro/

First please check that it is really an issue with the library, and not some special case of website:

Your report as follows:

newspaper4k
newspaper4k copied to clipboard