crawl4ai Crawl4AI Error: This page is not fully supported.

I was wondering if you could help me with a recurrent issue which I can find no repeatable solution for. Giving this URL as an example: https://www.newcleo.com/. I have tried many combinations of wait_for, and various js_code strategies but cannot access the actual page. I don't see any significant anti-bot measures on chrome. I do notice that for a split second a .gif animation pops up before the page renders. If i try to use crawl4ai without a delay I basically scrape this url. If i add a delay I see the following error.

async def main():
    async with AsyncWebCrawler(always_by_pass_cache=True, verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.newcleo.com/", 
            magic=True, 
            headless=True, 
            #delay_before_return_html=5.0
            )

        print(result.markdown)
        return None
     
if __name__ == "__main__":
    asyncio.run(main())

Crawl4AI Error: This page is not fully supported. Possible reasons: 1. The page may have restrictions that prevent crawling. 2. The page might not be fully loaded. Suggestions: - Try calling the crawl function with these parameters: magic=True, - Set headless=False to visualize what's happening on the page. If the issue persists, please check the page's structure and any potential anti-crawling measures.

Thanks for any help!

Nov 20 '24 11:11 Olliejp

@Olliejp Please try the following code:

async def main():
    async with AsyncWebCrawler(
            headless=False,  # Set to False to see what is happening
            verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://www.newcleo.com/",
            cache_mode=CacheMode.BYPASS,
            remove_overlay_elements=True,
            wait_for="css:.background-video"
        )
        print(len(result.markdown))

if __name__ == "__main__":
    asyncio.run(main())

Pay attention to this code. Typically when I face an issue, first I set the headless mode to false to see the browser and understand what is going on. In this case, you will notice that in headless mode, the initial GIF animation takes a little bit more time. Then, we will not get anything; we just get the content of that text animation, so that's why I use 'wait_for'. I waited for a specific class, which is important; it's background-video. There's also another way: you can just add a delay to use a delay before getting the HTML, and then pass that delay, and then wait for that one. Let me know if you have any other questions.

Nov 20 '24 12:11 unclecode

Thank you very much @unclecode !

So your code works for me. Since it would not know that this is an element to look for if scraping many sites I just added a delay of 5 seconds and it works. However, interestingly it seems that the root issue is actually the use of magic=True. For example if I run my code like this:

async def main():
    async with AsyncWebCrawler(
            headless=False,  # Set to False to see what is happening
            verbose=True,
            always_by_pass_cache=True
    ) as crawler:
        result = await crawler.arun(
            url="https://www.newcleo.com/",
            magic=True,
            #cache_mode=CacheMode.BYPASS,
            remove_overlay_elements=True,
            #wait_for="css:.background-video"
            delay_before_return_html=5.0
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

I get this error and I see just a blank screen on the browser (I am using version 0.3.731):

Crawl4AI Error: This page is not fully supported. Possible reasons: 1. The page may have restrictions that prevent crawling. 2. The page might not be fully loaded. Suggestions: - Try calling the crawl function with these parameters: magic=True, - Set headless=False to visualize what's happening on the page. If the issue persists, please check the page's structure and any potential anti-crawling measures.

However if I run this (Only change is magic=False), I clearly see the page content in the browse and the scrape works fine...

async def main():
    async with AsyncWebCrawler(
            headless=False,  # Set to False to see what is happening
            verbose=True,
            always_by_pass_cache=True
    ) as crawler:
        result = await crawler.arun(
            url="https://www.newcleo.com/",
            magic=False,
            #cache_mode=CacheMode.BYPASS,
            remove_overlay_elements=True,
            #wait_for="css:.background-video"
            delay_before_return_html=5.0
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Nov 20 '24 13:11 Olliejp

Just FYI exactly the same behavior with this url:

https://nanonuclearenergy.com/

With magic = True I get this error:

[ERROR] 🚫 arun(): Failed to crawl https://nanonuclearenergy.com/, error: [ERROR] 🚫 crawl(): Failed to crawl https://nanonuclearenergy.com/: Page.evaluate: Execution context was destroyed, most likely because of a navigation

and with magic=False, I get no error at all.

Nov 20 '24 13:11 Olliejp

@Olliejp you are right that you can use the delay before returning the final HTML; absolutely, it's a good way to not be dependent on its specific HTML elements on a website and other things. I just released a new version right now, and when I set magic to true, I don't get the error that you are getting, so I assume that recent changes fixed this issue, whatever it was. Please check that one. Let me know.

async def main():
    async with AsyncWebCrawler(
            headless=False,  # Set to False to see what is happening
            verbose=True,
    ) as crawler:
        result = await crawler.arun(
            # url="https://www.newcleo.com/",
            url="https://nanonuclearenergy.com/",
            cache_mode=CacheMode.BYPASS,
            magic=True,
            remove_overlay_elements=True,
            # wait_for="css:.background-video",
            delay_before_return_html=5.0
        )
        print(len(result.markdown))

if __name__ == "__main__":
    asyncio.run(main())

Output:

[INIT].... → Crawl4AI 0.3.74
[FETCH]... ↓ https://nanonuclearenergy.com/... | Status: True | Time: 7.63s
[SCRAPE].. ◆ Processed https://nanonuclearenergy.com/... | Time: 78ms
[COMPLETE] ● https://nanonuclearenergy.com/... | Status: True | Total: 7.73s
13303

Nov 23 '24 10:11 unclecode