newspaper
newspaper copied to clipboard
Authors and date are not correctly identified in wordpress website
>>> from newspaper import Article
>>> url = 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html'
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.authors
['Дизайн', 'Миша Гончаров', 'Воплощение']
>>> article.publish_date
In fact, one author: ['Иван Кузнецов']
And date of publication: 2021-10-05T12:00:54+00:00
- Some information is present in HTML attributes, for example, datetime.
- There is a script on the page, there is JSON, it contains all the original parameters. There is both the author and the publication date in full, along with the time.
This is a wordpress, I'm surprised that the meta data was determined incorrectly
Here is an overview document that I wrote on using newspaper3k. This document outlines how to extract the data elements from your page's structure.
Here is some basic code to get you started:
import json
from newspaper import Config
from newspaper import Article
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
article = Article("https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html", config=config)
article.download()
article.parse()
soup = BeautifulSoup(article.html, 'html.parser')
apple_insider_dictionary = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))
print(apple_insider_dictionary)
This outputs this:
{'@context': 'http://schema.org', '@graph': [{'@type': 'WebSite', 'url': 'https://appleinsider.ru', 'potentialAction': {'@type': 'SearchAction', 'target': 'https://appleinsider.ru?s={s}', 'query-input': 'required name=s', 'query': 'required name=s'}}, {'@type': 'BreadcrumbList', 'name': 'Breadcrumbs', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://appleinsider.ru', 'name': 'AppleInsider.ru'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'https://appleinsider.ru/tags', 'name': 'Темы'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://appleinsider.ru/iphone', 'name': 'iPhone'}}]}, {'@type': 'Article', '@id': 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html', 'name': 'Сколько стоят компоненты iPhone 13 Pro? Спойлер: это не себестоимость', 'headline': 'Сколько стоят компоненты iPhone 13 Pro? Спойлер: это не себестоимость', 'datePublished': '2021-10-05T12:00:54+00:00', 'dateModified': '2021-10-05T09:49:28+00:00', 'author': {'@type': 'Person', 'name': 'Иван Кузнецов'}, 'image': 'https://appleinsider.ru/wp-content/uploads/2021/10/iPhone_13_pro_true_cost-740x416.jpg', 'mainEntityOfPage': 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html', 'publisher': {'@type': 'Organization', 'name': 'AppleInsider.ru', 'logo': {'@type': 'ImageObject', 'url': 'https://appleinsider.ru/wp-content/themes/101media-ai-2015/img/logo_mini.png'}}}]}
You can obtain these data elements from this JSON:
- Author name
- datePublished
- dateModified
- headline
Let me know if you have any questions.
P.S. please close this issue if I my code and document help you.