crawl4ai How can we make crawling faster as it was getting slower for dynamically rendered website

trafficstars

I am trying to crawl links from websites, but it is either returning empty results or taking too long to retrieve the links. How can I implement a strategy to run it faster, stop redundant processes to save time, or add a retry mechanism to make it foolproof?

            try:
                result = await crawler.arun(
                    url=url,
                    bypass_cache=True,
                    verbose=True,
                    user_agent=random.choice(self.user_agents),
                )

                if hasattr(result, 'error_message') and result.error_message:
                    print(f"Error encountered while crawling {url}: {result.error_message}")
                    return []

                print(f"Successfully crawled: {result.url}")
                soup = BeautifulSoup(result.html, self.parser)
                links = set()
                base_netloc = urlparse(url).netloc

                for a_tag in soup.find_all('a', href=True):
                    href = a_tag['href']
                    # Remove trailing colon from href if present
                    if href.endswith(':'):
                        href = href.rstrip(':')

                    if href.startswith('/'):
                        full_url = urljoin(url, href)
                        links.add(full_url)
                    else:
                        href_netloc = urlparse(href).netloc
                        if href_netloc == base_netloc or href.startswith(url):
                            links.add(href)

                filtered_links = list(links) ```

Oct 18 '24 12:10 roshan-sinha-dev

@roshan-sinha-dev Thx using Crawl4ai, would you please share the url? So I can play around with it. Thx

Oct 19 '24 09:10 unclecode

Yeah sure, url would be https://poulta.com/. Sometimes it is returning empty array

Oct 20 '24 16:10 roshan-sinha-dev

@roshan-sinha-dev Roshan to let you know that our library will do most of the things that you tried to do in your code: we check for duplication, we also add the base URL to internal links, and then the result objects have a property called links that contains internal and external URLs from the page.

Please check the following code. We also have made some changes to optimize this process, and it's going to be the new version 0.3.72. I suggest you update your library tomorrow or the day after tomorrow.

async def main():
    async with AsyncWebCrawler(headless=False) as crawler:
        url = "https://poulta.com/"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            verbose=True
        )
            
        print(f"Successfully crawled: {result.url}")
        
        # Clean up any links that end with colons
        cleaned_internal = [link.rstrip(':') if link.endswith(':') else link for link in result.links['internal']]
        cleaned_external = [link.rstrip(':') if link.endswith(':') else link for link in result.links['external']]
        
        print("Internal links:", cleaned_internal[:5])  # First 5 internal links
        print("External links:", cleaned_external[:5])  # First 5 external links
        
    print("Done")

Oct 24 '24 12:10 unclecode

crawl4ai crawl4ai copied to clipboard

How can we make crawling faster as it was getting slower for dynamically rendered website

crawl4ai
crawl4ai copied to clipboard