crawl4ai
crawl4ai copied to clipboard
Crawling error
hey and thanks for this nice package! am having the following issue: some websites are randomly not scraped, while others get scraped correctly. On each run of the code which websites are scraped or not varies randomly. For the not scraped websites I get the following error: [ERROR] π« Failed to crawl https://random-website.com, error: 'NoneType' object has no attribute 'get'.
I save the .html files after scraping and the websites which are affected by this bug are saved in an html file with just ['', None] contained in the file.
tried to update all packages and also setup a new conda environment, but it didnt fix the issue. I am using WebCrawler, not the AsyncWebCrawler
@BlackChila Would you please share the link of the page you experienced this?
hey unclecode, thanks for your answer! I experience this on multiple pages and i can access the pages manually, so my IP is not blocked. Also it varies randomly if crawl4ai can access the page or not, on some run it accesses it, sometimes not, so i do not think it is an issue regarding the page itself. It affects eg https://de.wikipedia.org/wiki/Aral or https://www.aldi-nord.de/, but other times not. in total it affects 20-30 % of my crawled websites
@BlackChila Thx for sharing, Please do us a favor and try using the asynchronous method. Let's see if you get something similar with it or not. If you still face issues, then we'll start to do the stress test by crawling a set of links and websites to see when such things happen. However, let's just try it with asynchronous and let me know about it first. Thank you.
@unclecode Getting the same NoneType error. Here are the logs:
INFO: Started server process [14039] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) [LOG] π€οΈ Warming up the AsyncWebCrawler [LOG] π AsyncWebCrawler is ready to crawl Delaying for 10 seconds... Resuming... [LOG] πΈοΈ Crawling https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ using AsyncPlaywrightCrawlerStrategy... [LOG] β Crawled https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ successfully! [LOG] π Crawling done for https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, success: True, time taken: 0.74 seconds [ERROR] π« Failed to crawl https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str url='https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str' session_id=None responser_headers=None status_code=None INFO: 127.0.0.1:62431 - "GET / HTTP/1.1" 200 OK
from fastapi import FastAPI, HTTPException
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from dotenv import load_dotenv
import asyncio
import json
load_dotenv()
app = FastAPI()
@app.get("/")
async def crawl(url: str = "https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/"):
try:
async with AsyncWebCrawler(verbose=True) as crawler:
# Introduce Delay
print("Delaying for 10 seconds...")
await asyncio.sleep(10)
print("Resuming...")
# Extract data
result = await crawler.arun(url=url, bypass_cache=True)
# Return data
print(result)
return result.dict()
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Thanks @RhonnieAl for posting the same issue with the asynchronous method here!
Getting the same [ERROR] π« Failed to crawl error: Failed to extract content from the website: error: can only concatenate str (not "NoneType") to str
When trying to crawl a Notion site
+1. I am encountering this same issue.
@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] π€οΈ Warming up the AsyncWebCrawler [LOG] π AsyncWebCrawler is ready to crawl [LOG] πΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] β Crawled https://www.nbcnews.com/business successfully! [LOG] π Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] π Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] π₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] π Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []
I have a few questions in mind.
- Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
- which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same
I have the same error with the use of WebCrawler
The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'
So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass
@RhonnieAl Sorry for my delayed response. The links that you are trying to crawl have very strong bot detection. This is why they won't navigate to the page. For the error message, we made some adjustments to make the error message a little bit better in the new version, 0.3.7. You can update to this new version, and then get a better message. I think I'm going to release the new version within a day or two. One thing that you can do is always try to set the headless to false, so that you can see what's happening, and in this way, you'll get an understanding of what's going on. Here's a screenshot of what's happening. Fyi you can apply some sort of scripts and techniques using the hooks that we have in our library before going to a page to fix some of such issues. However, if you use the new version, the error message contains some useful information for you to try on different websites. Anyway, hopefully, this can be helpful for you.
@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] π€οΈ Warming up the AsyncWebCrawler [LOG] π AsyncWebCrawler is ready to crawl [LOG] πΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] β Crawled https://www.nbcnews.com/business successfully! [LOG] π Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] π Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] π₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] π Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []
I have a few questions in mind.
- Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
- which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same
Hi, would you please share with me your code snippet, so I can check it for you.
I have the same error with the use of WebCrawler
The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'
So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass
@mobyds This is interesting, would you please share the url caused this issue? Thx
https://chantepie.fr/
@mobyds It works for me, perhaps you can share with me your code as well as you system specs.
async def main():
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
url = "https://chantepie.fr/"
result = await crawler.arun(
url=url,
bypass_cache=True,
screenshot = True
)
# Save screenshot to file
with open(os.path.join(__data, "chantepie.png"), "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(result.markdown)
[LOG] π€οΈ Warming up the AsyncWebCrawler
[LOG] π AsyncWebCrawler is ready to crawl
[LOG] πΈοΈ Crawling https://chantepie.fr/ using AsyncPlaywrightCrawlerStrategy...
[LOG] β
Crawled https://chantepie.fr/ successfully!
[LOG] π Crawling done for https://chantepie.fr/, success: True, time taken: 5.08 seconds
[LOG] π Content extracted for https://chantepie.fr/, success: True, time taken: 0.29 seconds
[LOG] π₯ Extracting semantic blocks for https://chantepie.fr/, Strategy: AsyncWebCrawler
[LOG] π Extraction done for https://chantepie.fr/, time taken: 0.32 seconds.
Iy was with WebCrawler, not with AsyncWebCrawler
@unclecode Hi
@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] π€οΈ Warming up the AsyncWebCrawler [LOG] π AsyncWebCrawler is ready to crawl [LOG] πΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] β Crawled https://www.nbcnews.com/business successfully! [LOG] π Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] π Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] π₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] π Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 [] I have a few questions in mind.
- Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
- which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same
Hi, would you please share with me your code snippet, so I can check it for you.
Below is the code snippet I used for extraction. I find the issue most common with hindustantimes and NDTV. The news block is not getting extracted completely.
url1 = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922" url2 = "https://www.hindustantimes.com/world-news/israelhamas-war-live-updates-palestine-israel-latest-news-hamas-militant-group-attack-101696723677129.html"
related_content = [] os.environ['GEMINI_API_KEY'] = userdata.get('my_key')
async def process_urls(): async with AsyncWebCrawler(verbose=True) as crawler: for url in urls: # Perform extraction for each URL result = await crawler.arun( url=url1, extraction_strategy=LLMExtractionStrategy( provider="gemini/gemini-pro", bypass_cache=True, api_token=os.environ['GEMINI_API_KEY'], instruction="Extract only content related to Israel and hamas war and extract URL if available" ), )
if result.extracted_content is not None:
try:
extracted_data = json.loads(result.extracted_content)
related_content.extend(extracted_data) # Append extracted data for each URL
except json.JSONDecodeError as e:
print(f"Error decoding JSON for {url}: {e}")
print(f"Raw extracted content: {result.extracted_content}") # Debug raw content
else:
print(f"No content extracted by the LLM for {url}")
Execute the asynchronous function
asyncio.run(process_urls())
print(f"Number of related items extracted: {len(related_content)}") combined_data = [item.get('content') for item in related_content] print(combined_data)
Iy was with WebCrawler, not with AsyncWebCrawler
@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.
@DhrubojyotiDey I followed the first link that you shared here. The page is actually very long. Let me explain how this LLM extraction strategy works. By default, there is a chunking stage. This means that when you pass the content, break it into smaller chunks and then send every chunk in parallel to the language model. This is designed to be more suitable for smaller languages. Those small language models may not have a long context window. Therefore, we can make the most of them this way. If you're using a language model that supports long window contexts, such as Gemini, in your code, the best way to handle it is to either turn off this feature or to use a very long chunk length. Here's an example of both approaches. In my case, they work perfectly. I hope this is helpful for you.
async def main():
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
apply_chunking = False,
# chunk_token_threshold = 2 ** 14 # 16k tokens
instruction="""Extract only content related to Israel and hamas war and extract URL if available"""
)
async with AsyncWebCrawler() as crawler:
url = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
# magic=True
)
extracted_content = json.loads(result.extracted_content)
print(extracted_content)
print("Done")
Iy was with WebCrawler, not with AsyncWebCrawler
@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.
OK, and thanks a lot for this very useful lib
You're welcome @mobyds