trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Removing related links at end of article/sidebar on news websites?

Open rahulbot opened this issue 3 months ago • 2 comments

Over here in the Media Cloud project we're seeing poor performance on the content extraction task for a variety of pages that include links to other "related" stories at the end of article content. Our use case is trying to extract only article content as text. Do you have advice on tweaks to make to improve that performance? This might be the opposite of #518, because we do not want related links as part of content.

Here's sample code with real examples parsed in a way that looks very similar to our usage. The function returns true if the supplied text is included in the extracted content (the erroneous results, in our use case). Each of these incorrectly includes text that is part of a "related links" type callout that appears after article content. Any advice appreciated.

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))

rahulbot avatar May 03 '24 17:05 rahulbot