crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

How can we make crawling faster as it was getting slower for dynamically rendered website

Open roshan-sinha-dev opened this issue 1 year ago • 2 comments
trafficstars

I am trying to crawl links from websites, but it is either returning empty results or taking too long to retrieve the links. How can I implement a strategy to run it faster, stop redundant processes to save time, or add a retry mechanism to make it foolproof?

            try:
                result = await crawler.arun(
                    url=url,
                    bypass_cache=True,
                    verbose=True,
                    user_agent=random.choice(self.user_agents),
                )

                if hasattr(result, 'error_message') and result.error_message:
                    print(f"Error encountered while crawling {url}: {result.error_message}")
                    return []

                print(f"Successfully crawled: {result.url}")
                soup = BeautifulSoup(result.html, self.parser)
                links = set()
                base_netloc = urlparse(url).netloc

                for a_tag in soup.find_all('a', href=True):
                    href = a_tag['href']
                    # Remove trailing colon from href if present
                    if href.endswith(':'):
                        href = href.rstrip(':')

                    if href.startswith('/'):
                        full_url = urljoin(url, href)
                        links.add(full_url)
                    else:
                        href_netloc = urlparse(href).netloc
                        if href_netloc == base_netloc or href.startswith(url):
                            links.add(href)

                filtered_links = list(links) ```

roshan-sinha-dev avatar Oct 18 '24 12:10 roshan-sinha-dev

@roshan-sinha-dev Thx using Crawl4ai, would you please share the url? So I can play around with it. Thx

unclecode avatar Oct 19 '24 09:10 unclecode

Yeah sure, url would be https://poulta.com/. Sometimes it is returning empty array

roshan-sinha-dev avatar Oct 20 '24 16:10 roshan-sinha-dev

@roshan-sinha-dev Roshan to let you know that our library will do most of the things that you tried to do in your code: we check for duplication, we also add the base URL to internal links, and then the result objects have a property called links that contains internal and external URLs from the page.

Please check the following code. We also have made some changes to optimize this process, and it's going to be the new version 0.3.72. I suggest you update your library tomorrow or the day after tomorrow.

async def main():
    async with AsyncWebCrawler(headless=False) as crawler:
        url = "https://poulta.com/"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            verbose=True
        )
            
        print(f"Successfully crawled: {result.url}")
        
        # Clean up any links that end with colons
        cleaned_internal = [link.rstrip(':') if link.endswith(':') else link for link in result.links['internal']]
        cleaned_external = [link.rstrip(':') if link.endswith(':') else link for link in result.links['external']]
        
        print("Internal links:", cleaned_internal[:5])  # First 5 internal links
        print("External links:", cleaned_external[:5])  # First 5 external links
        
    print("Done")

unclecode avatar Oct 24 '24 12:10 unclecode