langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Make an option in WebBaseLoader to handle dynamic content that is loaded via JavaScript.

Open shamspias opened this issue 1 year ago • 0 comments

Feature request

When you request a webpage using a library like requests or aiohttp, you're getting the initial HTML of the page, but any content that's loaded via JavaScript after the page loads will not be included. That's why you might see template tags like (item.price)}} taka instead of the actual values. Those tags are placeholders that get filled in with actual data by JavaScript after the page loads.

To handle this, you'll need to use a library that can execute JavaScript. A commonly used one is Selenium, but it's heavier than requests or aiohttp because it requires running an actual web browser. But is there any other option that doesn't need running an actual web browser or can use in long-chain without needing graphical interface like using Headless Browsers tools like pyppeteer (Python wrapper for Puppeteer)

anyway please solve the issue and ad feathers like this. thanks in advance.

Motivation

To get dynamic content from a webpage while scraping text from a website or webpage.

Your contribution

For my side, I rewrite the _fetch method in your WebBaseLoader class to use pyppeteer instead of aiohttp. But still not working but I think this might little help. here is my code, there I Overwirte the class

import pyppeteer
import asyncio
from langchain.document_loaders import WebBaseLoader as BaseWebBaseLoader

class WebBaseLoader(BaseWebBaseLoader):

    async def _fetch(
            self, url: str, selector: str = 'body', retries: int = 3, cooldown: int = 2, backoff: float = 1.5
    ) -> str:
        for i in range(retries):
            try:
                browser = await pyppeteer.launch()
                page = await browser.newPage()
                await page.goto(url)
                await page.waitForSelector(selector)  # waits for a specific element to be loaded
                await asyncio.sleep(5)  # waits for 5 seconds before getting the content
                content = await page.content()  # This gets the full HTML, including any dynamically loaded content
                await browser.close()
                return content
            except Exception as e:
                if i == retries - 1:
                    raise
                else:
                    logger.warning(
                        f"Error fetching {url} with attempt "
                        f"{i + 1}/{retries}: {e}. Retrying..."
                    )
                    await asyncio.sleep(cooldown * backoff ** i)
        raise ValueError("retry count exceeded")


and Install this two lib

pip install pyppeteer
pyppeteer-install

shamspias avatar May 17 '23 06:05 shamspias