crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Extracting data from an iframe?

Open ehubb20 opened this issue 1 year ago • 5 comments
trafficstars

What is the best method for extracting data from an iframe using Crawl4ai?

Here is an example of the iframe I am trying to capture:

<div class="list-items new_properties_scroll"><ul><li><div class="list-item-des"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"></a><div class="container-fluid" style="max-width:1399px;"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"></a><div class="row item"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&amp;companyID=13160&amp;source=iframe" target="_blank"><div class="col-md-2 col-sm-2"><div style="background-image: url(https://s3.amazonaws.com/Rently_dev/images/51453851/medium);"></div></div><div class="col-md-4 col-sm-4 col-xs-4 basic-info"><div class="price priceWithTooltip"><h2><span class="amount">$1757</span><span class="unit"> / month</span></h2></div><div class="available-date"><h2>Available: Now</h2></div><span class="mini-address">231 Crestview Way, Dallas, GA, 30132, Un...</span><div class="info"><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/bed.svg"><span><strong style="font-size: 1.5em;">3</strong> Bed(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/shower-head.svg"><span><strong style="font-size: 1.5em;">2.5</strong> Bath(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/cat_dog.svg"><span style="line-height: 2.2;"> Cat + Dog</span></div><div class="col-md-6 col-sm-3 col-xs-3" style="line-height: 30px;"><img class="center" src="/assets/sq_ft.svg"><span>1530 Sq ft</span></div>

ehubb20 avatar Sep 23 '24 18:09 ehubb20

@ehubb20 Let me check snd update your.

unclecode avatar Sep 26 '24 07:09 unclecode

Hey @unclecode any update on this? I too am trying to figure out how to parse iframe content

b-sai avatar Oct 09 '24 17:10 b-sai

Any update on this?

shhivam avatar Oct 14 '24 11:10 shhivam

Hello Everyone @ehubb20 @b-sai @shhivam sorry for the late reply. We've been very busy bringing a lot of new features, and one of them is actually extracting the kind of information from the iframe. It's still early days, so it's going to be with the new version 0.3.6, which we're going to release by tomorrow. I definitely expect some bugs, so please use it and report any issues you come across, and we can fix them right away.

It currently extracts the content of the "body" of the iframe, replaces it with a div element in the main page, making it part of the main page. You can think of it as a way of flattening, but what we extract is the body content of the iframe. We plan to add more options and parameters for extracting these elements.

Btw without that when you crawl a page, you get all internal/external links and then scrape those links for iframes again. This already provides a lot of options.

Anyway I've shared a sample of the code with you here. Hopefully, when we update the library, you'll be able to use it. I appreciate it if you could let us know about any bugs or issues you encounter.

async def main():
    async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
        url = "https://zcgwq2-5000.csb.app"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            process_iframes=True
        )

I keep the issue open, in case you face with any error.

unclecode avatar Oct 14 '24 12:10 unclecode

Thanks @unclecode for the clarification!

shhivam avatar Oct 14 '24 12:10 shhivam

@shhivam The iframe extraction is already available, please check:

async def test_oframe():
    async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
        url = "URL-HERE"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            process_iframes=True
        )

unclecode avatar Oct 17 '24 07:10 unclecode

Does the param process_iframes=True work? I have the same problem with scraping content from iframe, below is my key code

async def after_goto(page: Page, context: BrowserContext, url: str, **kwargs):
        # 等待page加载完成
        await page.wait_for_selector("#me-iframe-container",state="attached")
        frameLocator = page.frame_locator("#me-iframe-container")
        
        if frameLocator:
            print("frameLocator found")
            locator = frameLocator.get_by_text("全部订单")
            await locator.click()
        else:
            print("frameLocator not found")
        os.makedirs("screenshots", exist_ok=True)
        await page.screenshot(path=os.path.abspath("screenshots\\screenshot.png"))
        return page

  async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler.crawler_strategy.set_hook("after_goto", after_goto)   
        crawler.crawler_strategy.set_hook("before_return_html", before_return_html) 
        result = await crawler.arun(
            process_iframes=True,
            config=crawl_config,
            url="https://me.meituan.com/ebooking/merchant/ebIframe?iUrl=%2Febooking%2Forder-eb%2Findex.html%23%2Fall",
        )
        if result.success:
            print("Successfully accessed private data with your identity!")
            print(result.cleaned_html)
            print(len(result.cleaned_html))
        else:
            print("Error:", result.error_message)

and also I see Crawl4ai document and that said: put the param in CrawlerRunConfig like this

config = CrawlerRunConfig(
        process_iframes=True,
        remove_overlay_elements=True
    )

both way did not work for me.

CrpMihasha avatar Mar 28 '25 09:03 CrpMihasha

It doesn't work. Hi @CrpMihasha Have you found a solution?

JoysKang avatar Apr 29 '25 01:04 JoysKang

亲,邮件已收到

CrpMihasha avatar Apr 29 '25 01:04 CrpMihasha