trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Are there any settings that allow us to make sure that the full article is scraped inspead of just the initial part of it?

Open armsp opened this issue 3 years ago • 7 comments

The article in question is this by The New Yorker. I ran the code as follows -

from trafilatura import bare_extraction, fetch_url, extract
from trafilatura.settings import use_config

myconfig = use_config('./settings.cfg')

url = "https://www.newyorker.com/magazine/2021/07/05/my-apology"

downloaded = fetch_url(url, config=myconfig)

result = extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)
print(result)

And instead of the whole article I just get a few paragraphs.

If I use no_fallback=False then I get a few more lines but its still far from complete.

Are there any advanced usages of the library using which we can ensure that the full artilcle is extracted?

armsp avatar Jul 02 '21 15:07 armsp

Hi @armsp,

This isn't strictly speaking an extraction problem, it is a rendering issue, that's why changing the setting doesn't affect the output. The text stops at "you did something wrong.”" because it is the end of the text in the HTML file if you don't execute JavaScript, my guess is that text elements are loaded after the page is opened.

You could have a look at pyppeteer to download and render the page before passing it to trafilatura.

adbar avatar Jul 02 '21 17:07 adbar

@adbar I think you are right. To test that hypothesis I used selenium to scroll to the end and pass the html to trafilatura. But unfortunately I get the same result as before.

import trafilatura
from trafilatura import extract
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.headless = False

driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Users\\---\\chromedriver.exe')

url = 'https://www.newyorker.com/magazine/2021/07/05/my-apology'

def scroll_down():
    global driver
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height


driver.get(url)
scroll_down()
with open("page_source.html", "w+", encoding="utf-8") as f:
    f.write(driver.page_source)

downloaded = trafilatura.load_html(driver.page_source)
data = extract(downloaded,with_metadata=True)
print(data)

Here you can open the page_source.html file and see that the whole page has been downloaded, yet trafilatura extracts only a part from it.

Any suggestions on how to ensure that it gets everything right?

armsp avatar Jul 06 '21 19:07 armsp

Hi @armsp, thanks for looking deeper into it!

According to my tests Trafilatura outputs too little text and Readability too much... Concerning Trafilatura, the problem has to do with the XPath expressions used to find the main content. For some reason, the webpage uses a <div class="article__body"> tag that doesn't comprise all the article :( I don't want to change the behavior of the software for cases like this as it harms precision in general. I'll keep track of the bug and try to improve it in the future.

If you want to override the main extraction and use Trafilatura's functionality please have a look at the external functions in the docs. The Readability algorithm outputs too much text in this case, your can decide what to do.

adbar avatar Jul 08 '21 14:07 adbar

Side note, idea by @naftalibeder: check for additional siblings with the same class name as the found article.

adbar avatar Jan 14 '22 17:01 adbar

@adbar I’m having a go at a PR that handles this case, I’ll share when I have something.

naftalibeder avatar Jan 15 '22 14:01 naftalibeder

Nice, please go ahead!

adbar avatar Jan 17 '22 16:01 adbar

I believe this is fixed in #163. It's interesting to note that at least with New Yorker articles, Readability also frequently fails to get all of the content. Looks like it's not about to get fixed. :)

naftalibeder avatar Jan 20 '22 04:01 naftalibeder

I can confirm that the issue appears to be fixed.

adbar avatar Feb 05 '24 11:02 adbar