crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Crawling error

Open BlackChila opened this issue 1 year ago β€’ 19 comments

hey and thanks for this nice package! am having the following issue: some websites are randomly not scraped, while others get scraped correctly. On each run of the code which websites are scraped or not varies randomly. For the not scraped websites I get the following error: [ERROR] 🚫 Failed to crawl https://random-website.com, error: 'NoneType' object has no attribute 'get'.

I save the .html files after scraping and the websites which are affected by this bug are saved in an html file with just ['', None] contained in the file.

tried to update all packages and also setup a new conda environment, but it didnt fix the issue. I am using WebCrawler, not the AsyncWebCrawler

BlackChila avatar Sep 27 '24 13:09 BlackChila

@BlackChila Would you please share the link of the page you experienced this?

unclecode avatar Sep 28 '24 00:09 unclecode

hey unclecode, thanks for your answer! I experience this on multiple pages and i can access the pages manually, so my IP is not blocked. Also it varies randomly if crawl4ai can access the page or not, on some run it accesses it, sometimes not, so i do not think it is an issue regarding the page itself. It affects eg https://de.wikipedia.org/wiki/Aral or https://www.aldi-nord.de/, but other times not. in total it affects 20-30 % of my crawled websites

BlackChila avatar Sep 28 '24 10:09 BlackChila

@BlackChila Thx for sharing, Please do us a favor and try using the asynchronous method. Let's see if you get something similar with it or not. If you still face issues, then we'll start to do the stress test by crawling a set of links and websites to see when such things happen. However, let's just try it with asynchronous and let me know about it first. Thank you.

unclecode avatar Sep 28 '24 15:09 unclecode

@unclecode Getting the same NoneType error. Here are the logs:

INFO: Started server process [14039] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) [LOG] 🌀️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl Delaying for 10 seconds... Resuming... [LOG] πŸ•ΈοΈ Crawling https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ using AsyncPlaywrightCrawlerStrategy... [LOG] βœ… Crawled https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/ successfully! [LOG] πŸš€ Crawling done for https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, success: True, time taken: 0.74 seconds [ERROR] 🚫 Failed to crawl https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str url='https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to extract content from the website: https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/, error: can only concatenate str (not "NoneType") to str' session_id=None responser_headers=None status_code=None INFO: 127.0.0.1:62431 - "GET / HTTP/1.1" 200 OK

from fastapi import FastAPI, HTTPException
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from dotenv import load_dotenv
import asyncio
import json


load_dotenv() 

app = FastAPI()

@app.get("/")

async def crawl(url: str = "https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/"):
    try:
        async with AsyncWebCrawler(verbose=True) as crawler:
            
            # Introduce Delay
            print("Delaying for 10 seconds...")
            await asyncio.sleep(10)
            print("Resuming...")

            # Extract data
            result = await crawler.arun(url=url, bypass_cache=True)
            
            # Return data
            print(result)
            return result.dict()
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)



RhonnieAl avatar Sep 28 '24 18:09 RhonnieAl

Thanks @RhonnieAl for posting the same issue with the asynchronous method here!

BlackChila avatar Sep 30 '24 08:09 BlackChila

Getting the same [ERROR] 🚫 Failed to crawl error: Failed to extract content from the website: error: can only concatenate str (not "NoneType") to str

When trying to crawl a Notion site

ojaros avatar Sep 30 '24 19:09 ojaros

+1. I am encountering this same issue.

xansrnitu avatar Oct 02 '24 14:10 xansrnitu

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌀️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] πŸ•ΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] βœ… Crawled https://www.nbcnews.com/business successfully! [LOG] πŸš€ Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] πŸš€ Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] πŸš€ Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []

I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

DhrubojyotiDey avatar Oct 13 '24 16:10 DhrubojyotiDey

I have the same error with the use of WebCrawler

The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass

mobyds avatar Oct 17 '24 09:10 mobyds

image

@RhonnieAl Sorry for my delayed response. The links that you are trying to crawl have very strong bot detection. This is why they won't navigate to the page. For the error message, we made some adjustments to make the error message a little bit better in the new version, 0.3.7. You can update to this new version, and then get a better message. I think I'm going to release the new version within a day or two. One thing that you can do is always try to set the headless to false, so that you can see what's happening, and in this way, you'll get an understanding of what's going on. Here's a screenshot of what's happening. Fyi you can apply some sort of scripts and techniques using the hooks that we have in our library before going to a page to fix some of such issues. However, if you use the new version, the error message contains some useful information for you to try on different websites. Anyway, hopefully, this can be helpful for you.

unclecode avatar Oct 17 '24 13:10 unclecode

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌀️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] πŸ•ΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] βœ… Crawled https://www.nbcnews.com/business successfully! [LOG] πŸš€ Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] πŸš€ Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] πŸš€ Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 []

I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

unclecode avatar Oct 17 '24 13:10 unclecode

I have the same error with the use of WebCrawler

The problem is in the file utils.py File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\crawl4ai\utils.py", line 694, in get_content_of_website_optimized src = img.get('src', '') src = img.get('src', '') ^^^^^^^^^^^^^^^^^^ File "C:\Dev\GIT\civic-crawler.venv\Lib\site-packages\bs4\element.py", line 1547, in get return self.attrs.get(key, default) AttributeError: 'NoneType' object has no attribute 'get'

So I patch with a Try/except: try: for img in imgs: src = img.get('src', '') if base64_pattern.match(src): # Replace base64 data with empty string img['src'] = base64_pattern.sub('', src) except Exception as e: pass

@mobyds This is interesting, would you please share the url caused this issue? Thx

unclecode avatar Oct 17 '24 13:10 unclecode

https://chantepie.fr/

mobyds avatar Oct 17 '24 13:10 mobyds

@mobyds It works for me, perhaps you can share with me your code as well as you system specs.

image

async def main():
    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        url = "https://chantepie.fr/"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            screenshot = True
        )
        
        # Save screenshot to file
        with open(os.path.join(__data, "chantepie.png"), "wb") as f:
            f.write(base64.b64decode(result.screenshot))
        
        print(result.markdown)
[LOG] 🌀️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] πŸ•ΈοΈ Crawling https://chantepie.fr/ using AsyncPlaywrightCrawlerStrategy...
[LOG] βœ… Crawled https://chantepie.fr/ successfully!
[LOG] πŸš€ Crawling done for https://chantepie.fr/, success: True, time taken: 5.08 seconds
[LOG] πŸš€ Content extracted for https://chantepie.fr/, success: True, time taken: 0.29 seconds
[LOG] πŸ”₯ Extracting semantic blocks for https://chantepie.fr/, Strategy: AsyncWebCrawler
[LOG] πŸš€ Extraction done for https://chantepie.fr/, time taken: 0.32 seconds.

unclecode avatar Oct 17 '24 13:10 unclecode

Iy was with WebCrawler, not with AsyncWebCrawler

mobyds avatar Oct 17 '24 15:10 mobyds

@unclecode Hi

@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. The following is the below error: LOG] 🌀️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] πŸ•ΈοΈ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy... [LOG] βœ… Crawled https://www.nbcnews.com/business successfully! [LOG] πŸš€ Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.51 seconds [LOG] πŸš€ Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds [LOG] πŸ”₯ Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler [LOG] Call LLM for https://www.nbcnews.com/business - block index: 0 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 1 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 2 [LOG] Call LLM for https://www.nbcnews.com/business - block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 3 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 0 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 1 [LOG] Extracted 0 blocks from URL: https://www.nbcnews.com/business block index: 2 [LOG] πŸš€ Extraction done for https://www.nbcnews.com/business, time taken: 4.68 seconds. Number of related items extracted: 0 [] I have a few questions in mind.

  1. Why does this error occur with every news site.(assuming I am creating a project of extracting news topics from given news sites)
  2. which are the sites where it has worked previously. I have tried using the example in the library site "https://www.nbcnews.com/business" and the message was same

Hi, would you please share with me your code snippet, so I can check it for you.

Below is the code snippet I used for extraction. I find the issue most common with hindustantimes and NDTV. The news block is not getting extracted completely.

url1 = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922" url2 = "https://www.hindustantimes.com/world-news/israelhamas-war-live-updates-palestine-israel-latest-news-hamas-militant-group-attack-101696723677129.html"

related_content = [] os.environ['GEMINI_API_KEY'] = userdata.get('my_key')

async def process_urls(): async with AsyncWebCrawler(verbose=True) as crawler: for url in urls: # Perform extraction for each URL result = await crawler.arun( url=url1, extraction_strategy=LLMExtractionStrategy( provider="gemini/gemini-pro", bypass_cache=True, api_token=os.environ['GEMINI_API_KEY'], instruction="Extract only content related to Israel and hamas war and extract URL if available" ), )

        if result.extracted_content is not None:
            try:
                extracted_data = json.loads(result.extracted_content)
                related_content.extend(extracted_data)  # Append extracted data for each URL
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON for {url}: {e}")
                print(f"Raw extracted content: {result.extracted_content}")  # Debug raw content
        else:
            print(f"No content extracted by the LLM for {url}")

Execute the asynchronous function

asyncio.run(process_urls())

print(f"Number of related items extracted: {len(related_content)}") combined_data = [item.get('content') for item in related_content] print(combined_data)

DhrubojyotiDey avatar Oct 18 '24 03:10 DhrubojyotiDey

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

unclecode avatar Oct 18 '24 08:10 unclecode

@DhrubojyotiDey I followed the first link that you shared here. The page is actually very long. Let me explain how this LLM extraction strategy works. By default, there is a chunking stage. This means that when you pass the content, break it into smaller chunks and then send every chunk in parallel to the language model. This is designed to be more suitable for smaller languages. Those small language models may not have a long context window. Therefore, we can make the most of them this way. If you're using a language model that supports long window contexts, such as Gemini, in your code, the best way to handle it is to either turn off this feature or to use a very long chunk length. Here's an example of both approaches. In my case, they work perfectly. I hope this is helpful for you.

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            apply_chunking = False,
            # chunk_token_threshold = 2 ** 14 # 16k tokens
            instruction="""Extract only content related to Israel and hamas war and extract URL if available"""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            
            # magic=True
        )
        extracted_content = json.loads(result.extracted_content)
        print(extracted_content)

    print("Done")

unclecode avatar Oct 18 '24 10:10 unclecode

Iy was with WebCrawler, not with AsyncWebCrawler

@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library.

OK, and thanks a lot for this very useful lib

mobyds avatar Oct 21 '24 10:10 mobyds

You're welcome @mobyds

unclecode avatar Oct 24 '24 12:10 unclecode