newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

Authors and date are not correctly identified in wordpress website

Open alekssamos opened this issue 2 years ago • 2 comments

>>> from newspaper import Article

>>> url = 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html'
>>> article = Article(url)
>>> article.download()

>>> article.parse()

>>> article.authors
['Дизайн', 'Миша Гончаров', 'Воплощение']

>>> article.publish_date

In fact, one author: ['Иван Кузнецов'] And date of publication: 2021-10-05T12:00:54+00:00

alekssamos avatar Feb 24 '22 11:02 alekssamos

  1. Some information is present in HTML attributes, for example, datetime.
  2. There is a script on the page, there is JSON, it contains all the original parameters. There is both the author and the publication date in full, along with the time.

This is a wordpress, I'm surprised that the meta data was determined incorrectly

alekssamos avatar Feb 24 '22 11:02 alekssamos

Here is an overview document that I wrote on using newspaper3k. This document outlines how to extract the data elements from your page's structure.

Here is some basic code to get you started:

import json
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

article = Article("https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html", config=config)
article.download()
article.parse()

soup = BeautifulSoup(article.html, 'html.parser')
apple_insider_dictionary = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))
print(apple_insider_dictionary)

This outputs this:

{'@context': 'http://schema.org', '@graph': [{'@type': 'WebSite', 'url': 'https://appleinsider.ru', 'potentialAction': {'@type': 'SearchAction', 'target': 'https://appleinsider.ru?s={s}', 'query-input': 'required name=s', 'query': 'required name=s'}}, {'@type': 'BreadcrumbList', 'name': 'Breadcrumbs', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://appleinsider.ru', 'name': 'AppleInsider.ru'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'https://appleinsider.ru/tags', 'name': 'Темы'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://appleinsider.ru/iphone', 'name': 'iPhone'}}]}, {'@type': 'Article', '@id': 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html', 'name': 'Сколько стоят компоненты iPhone 13 Pro? Спойлер: это не себестоимость', 'headline': 'Сколько стоят компоненты iPhone 13 Pro? Спойлер: это не себестоимость', 'datePublished': '2021-10-05T12:00:54+00:00', 'dateModified': '2021-10-05T09:49:28+00:00', 'author': {'@type': 'Person', 'name': 'Иван Кузнецов'}, 'image': 'https://appleinsider.ru/wp-content/uploads/2021/10/iPhone_13_pro_true_cost-740x416.jpg', 'mainEntityOfPage': 'https://appleinsider.ru/iphone/skolko-stoyat-komponenty-iphone-13-pro-spojler-eto-ne-sebestoimost.html', 'publisher': {'@type': 'Organization', 'name': 'AppleInsider.ru', 'logo': {'@type': 'ImageObject', 'url': 'https://appleinsider.ru/wp-content/themes/101media-ai-2015/img/logo_mini.png'}}}]}

You can obtain these data elements from this JSON:

  • Author name
  • datePublished
  • dateModified
  • headline

Let me know if you have any questions.

P.S. please close this issue if I my code and document help you.

johnbumgarner avatar Aug 07 '22 13:08 johnbumgarner