crawl4ai
crawl4ai copied to clipboard
How can we make crawling faster as it was getting slower for dynamically rendered website
I am trying to crawl links from websites, but it is either returning empty results or taking too long to retrieve the links. How can I implement a strategy to run it faster, stop redundant processes to save time, or add a retry mechanism to make it foolproof?
try:
result = await crawler.arun(
url=url,
bypass_cache=True,
verbose=True,
user_agent=random.choice(self.user_agents),
)
if hasattr(result, 'error_message') and result.error_message:
print(f"Error encountered while crawling {url}: {result.error_message}")
return []
print(f"Successfully crawled: {result.url}")
soup = BeautifulSoup(result.html, self.parser)
links = set()
base_netloc = urlparse(url).netloc
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
# Remove trailing colon from href if present
if href.endswith(':'):
href = href.rstrip(':')
if href.startswith('/'):
full_url = urljoin(url, href)
links.add(full_url)
else:
href_netloc = urlparse(href).netloc
if href_netloc == base_netloc or href.startswith(url):
links.add(href)
filtered_links = list(links) ```
@roshan-sinha-dev Thx using Crawl4ai, would you please share the url? So I can play around with it. Thx
Yeah sure, url would be https://poulta.com/. Sometimes it is returning empty array
@roshan-sinha-dev Roshan to let you know that our library will do most of the things that you tried to do in your code: we check for duplication, we also add the base URL to internal links, and then the result objects have a property called links that contains internal and external URLs from the page.
Please check the following code. We also have made some changes to optimize this process, and it's going to be the new version 0.3.72. I suggest you update your library tomorrow or the day after tomorrow.
async def main():
async with AsyncWebCrawler(headless=False) as crawler:
url = "https://poulta.com/"
result = await crawler.arun(
url=url,
bypass_cache=True,
verbose=True
)
print(f"Successfully crawled: {result.url}")
# Clean up any links that end with colons
cleaned_internal = [link.rstrip(':') if link.endswith(':') else link for link in result.links['internal']]
cleaned_external = [link.rstrip(':') if link.endswith(':') else link for link in result.links['external']]
print("Internal links:", cleaned_internal[:5]) # First 5 internal links
print("External links:", cleaned_external[:5]) # First 5 external links
print("Done")