trafilatura
trafilatura copied to clipboard
Are there any settings that allow us to make sure that the full article is scraped inspead of just the initial part of it?
The article in question is this by The New Yorker. I ran the code as follows -
from trafilatura import bare_extraction, fetch_url, extract
from trafilatura.settings import use_config
myconfig = use_config('./settings.cfg')
url = "https://www.newyorker.com/magazine/2021/07/05/my-apology"
downloaded = fetch_url(url, config=myconfig)
result = extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)
print(result)
And instead of the whole article I just get a few paragraphs.
If I use no_fallback=False
then I get a few more lines but its still far from complete.
Are there any advanced usages of the library using which we can ensure that the full artilcle is extracted?
Hi @armsp,
This isn't strictly speaking an extraction problem, it is a rendering issue, that's why changing the setting doesn't affect the output. The text stops at "you did something wrong.”" because it is the end of the text in the HTML file if you don't execute JavaScript, my guess is that text elements are loaded after the page is opened.
You could have a look at pyppeteer to download and render the page before passing it to trafilatura.
@adbar I think you are right. To test that hypothesis I used selenium to scroll to the end and pass the html to trafilatura. But unfortunately I get the same result as before.
import trafilatura
from trafilatura import extract
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = False
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Users\\---\\chromedriver.exe')
url = 'https://www.newyorker.com/magazine/2021/07/05/my-apology'
def scroll_down():
global driver
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
driver.get(url)
scroll_down()
with open("page_source.html", "w+", encoding="utf-8") as f:
f.write(driver.page_source)
downloaded = trafilatura.load_html(driver.page_source)
data = extract(downloaded,with_metadata=True)
print(data)
Here you can open the page_source.html
file and see that the whole page has been downloaded, yet trafilatura extracts only a part from it.
Any suggestions on how to ensure that it gets everything right?
Hi @armsp, thanks for looking deeper into it!
According to my tests Trafilatura outputs too little text and Readability too much... Concerning Trafilatura, the problem has to do with the XPath expressions used to find the main content.
For some reason, the webpage uses a <div class="article__body">
tag that doesn't comprise all the article :(
I don't want to change the behavior of the software for cases like this as it harms precision in general. I'll keep track of the bug and try to improve it in the future.
If you want to override the main extraction and use Trafilatura's functionality please have a look at the external functions in the docs. The Readability algorithm outputs too much text in this case, your can decide what to do.
Side note, idea by @naftalibeder: check for additional siblings with the same class name as the found article.
@adbar I’m having a go at a PR that handles this case, I’ll share when I have something.
Nice, please go ahead!
I believe this is fixed in #163. It's interesting to note that at least with New Yorker articles, Readability also frequently fails to get all of the content. Looks like it's not about to get fixed. :)
I can confirm that the issue appears to be fixed.