langchain
langchain copied to clipboard
Make an option in WebBaseLoader to handle dynamic content that is loaded via JavaScript.
Feature request
When you request a webpage using a library like requests or aiohttp, you're getting the initial HTML of the page, but any content that's loaded via JavaScript after the page loads will not be included. That's why you might see template tags like (item.price)}} taka instead of the actual values. Those tags are placeholders that get filled in with actual data by JavaScript after the page loads.
To handle this, you'll need to use a library that can execute JavaScript. A commonly used one is Selenium, but it's heavier than requests or aiohttp because it requires running an actual web browser.
But is there any other option that doesn't need running an actual web browser or can use in long-chain without needing graphical interface like using Headless Browsers tools like pyppeteer
(Python wrapper for Puppeteer)
anyway please solve the issue and ad feathers like this. thanks in advance.
Motivation
To get dynamic content from a webpage while scraping text from a website or webpage.
Your contribution
For my side, I rewrite the _fetch method in your WebBaseLoader class to use pyppeteer instead of aiohttp. But still not working but I think this might little help. here is my code, there I Overwirte the class
import pyppeteer
import asyncio
from langchain.document_loaders import WebBaseLoader as BaseWebBaseLoader
class WebBaseLoader(BaseWebBaseLoader):
async def _fetch(
self, url: str, selector: str = 'body', retries: int = 3, cooldown: int = 2, backoff: float = 1.5
) -> str:
for i in range(retries):
try:
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto(url)
await page.waitForSelector(selector) # waits for a specific element to be loaded
await asyncio.sleep(5) # waits for 5 seconds before getting the content
content = await page.content() # This gets the full HTML, including any dynamically loaded content
await browser.close()
return content
except Exception as e:
if i == retries - 1:
raise
else:
logger.warning(
f"Error fetching {url} with attempt "
f"{i + 1}/{retries}: {e}. Retrying..."
)
await asyncio.sleep(cooldown * backoff ** i)
raise ValueError("retry count exceeded")
and Install this two lib
pip install pyppeteer
pyppeteer-install