crawl4ai
crawl4ai copied to clipboard
Extracting data from an iframe?
What is the best method for extracting data from an iframe using Crawl4ai?
Here is an example of the iframe I am trying to capture:
<div class="list-items new_properties_scroll"><ul><li><div class="list-item-des"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&companyID=13160&source=iframe" target="_blank"></a><div class="container-fluid" style="max-width:1399px;"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&companyID=13160&source=iframe" target="_blank"></a><div class="row item"><a class="list_image_click" href="https://homes.rently.com/homes-for-rent/properties/4203494?fromsearch=true&companyID=13160&source=iframe" target="_blank"><div class="col-md-2 col-sm-2"><div style="background-image: url(https://s3.amazonaws.com/Rently_dev/images/51453851/medium);"></div></div><div class="col-md-4 col-sm-4 col-xs-4 basic-info"><div class="price priceWithTooltip"><h2><span class="amount">$1757</span><span class="unit"> / month</span></h2></div><div class="available-date"><h2>Available: Now</h2></div><span class="mini-address">231 Crestview Way, Dallas, GA, 30132, Un...</span><div class="info"><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/bed.svg"><span><strong style="font-size: 1.5em;">3</strong> Bed(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/shower-head.svg"><span><strong style="font-size: 1.5em;">2.5</strong> Bath(s)</span></div><div class="col-md-6 col-sm-3 col-xs-3"><img class="center" src="/assets/cat_dog.svg"><span style="line-height: 2.2;"> Cat + Dog</span></div><div class="col-md-6 col-sm-3 col-xs-3" style="line-height: 30px;"><img class="center" src="/assets/sq_ft.svg"><span>1530 Sq ft</span></div>
@ehubb20 Let me check snd update your.
Hey @unclecode any update on this? I too am trying to figure out how to parse iframe content
Any update on this?
Hello Everyone @ehubb20 @b-sai @shhivam sorry for the late reply. We've been very busy bringing a lot of new features, and one of them is actually extracting the kind of information from the iframe. It's still early days, so it's going to be with the new version 0.3.6, which we're going to release by tomorrow. I definitely expect some bugs, so please use it and report any issues you come across, and we can fix them right away.
It currently extracts the content of the "body" of the iframe, replaces it with a div element in the main page, making it part of the main page. You can think of it as a way of flattening, but what we extract is the body content of the iframe. We plan to add more options and parameters for extracting these elements.
Btw without that when you crawl a page, you get all internal/external links and then scrape those links for iframes again. This already provides a lot of options.
Anyway I've shared a sample of the code with you here. Hopefully, when we update the library, you'll be able to use it. I appreciate it if you could let us know about any bugs or issues you encounter.
async def main():
async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
url = "https://zcgwq2-5000.csb.app"
result = await crawler.arun(
url=url,
bypass_cache=True,
process_iframes=True
)
I keep the issue open, in case you face with any error.
Thanks @unclecode for the clarification!
@shhivam The iframe extraction is already available, please check:
async def test_oframe():
async with AsyncWebCrawler(verbose=True, headless = False) as crawler:
url = "URL-HERE"
result = await crawler.arun(
url=url,
bypass_cache=True,
process_iframes=True
)
Does the param process_iframes=True work? I have the same problem with scraping content from iframe, below is my key code
async def after_goto(page: Page, context: BrowserContext, url: str, **kwargs):
# 等待page加载完成
await page.wait_for_selector("#me-iframe-container",state="attached")
frameLocator = page.frame_locator("#me-iframe-container")
if frameLocator:
print("frameLocator found")
locator = frameLocator.get_by_text("全部订单")
await locator.click()
else:
print("frameLocator not found")
os.makedirs("screenshots", exist_ok=True)
await page.screenshot(path=os.path.abspath("screenshots\\screenshot.png"))
return page
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
result = await crawler.arun(
process_iframes=True,
config=crawl_config,
url="https://me.meituan.com/ebooking/merchant/ebIframe?iUrl=%2Febooking%2Forder-eb%2Findex.html%23%2Fall",
)
if result.success:
print("Successfully accessed private data with your identity!")
print(result.cleaned_html)
print(len(result.cleaned_html))
else:
print("Error:", result.error_message)
and also I see Crawl4ai document and that said: put the param in CrawlerRunConfig like this
config = CrawlerRunConfig(
process_iframes=True,
remove_overlay_elements=True
)
both way did not work for me.
It doesn't work. Hi @CrpMihasha Have you found a solution?
亲,邮件已收到