crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

NotImplementedError in asyncio.create_subprocess_exec on Windows

Open KaifAhmad1 opened this issue 1 year ago • 10 comments

Issues with crawl4ai Library

1. NotImplementedError in asyncio.create_subprocess_exec on Windows

Description

The crawl4ai library uses asyncio.create_subprocess_exec to start the Playwright browser, which is not supported on Windows. This results in a NotImplementedError, preventing the AsyncWebCrawler from functioning correctly on Windows platforms.

Steps to Reproduce

  1. Run the code on a Windows platform.
  2. Use the AsyncWebCrawler to crawl a URL.
  3. Observe the NotImplementedError when starting the Playwright browser.

Expected Behavior

The AsyncWebCrawler should start the Playwright browser and crawl the URL without raising a NotImplementedError.

Actual Behavior

The code raises a NotImplementedError when attempting to start the Playwright browser using asyncio.create_subprocess_exec.

KaifAhmad1 avatar Nov 21 '24 07:11 KaifAhmad1

@KaifAhmad1 thanks for sharing this. I will check on the Windows machine and I will update you in the meantime. If you share with me the code snippet that you use, I will do better because I will try to just see if there is anything missed in the way that you call the crowd for AI or not. Please share with me that one as well.

unclecode avatar Nov 23 '24 10:11 unclecode

@unclecode any updates on this issue as I'm facing the similar issue myself. The behaviour is slightly confusing to me. When I just used it as a script to test it works fine and crawls as expected. But ,then for my use case I tried using it with fastapi in my post route and it started throwing the same NotNotImplementedError to me.

I'm using a windows10 machine crawl4ai version 0.3.71 (As the latest version does not works even for the script this looked more stable as latest version was throwing some playwright related errors and through github issues i found this version seems stable)

Aniket1026 avatar Dec 08 '24 12:12 Aniket1026

@Aniket1026 Hello, thank you for using Crawl for AI! Could you please share your exact code or snippet with me?

When I tried it on my Windows machine, I didn’t encounter this issue, so I’d like to investigate further. Please also share the specs of your operating system, Python version, and any other details you think might help. Once I have that, I’ll try to replicate the error on my end.

unclecode avatar Dec 08 '24 12:12 unclecode

This is how the script which I used to test looks like and it works fine with crawl4ai==0.3.71 version. As with latest version 0.4.0 it does not even works . You can test this script using any url of an amazon product. command to run : python filename

import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import ExtractionStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

product_schema = {
    "name": "Product Details",
    "baseSelector": "div#centerCol",
    "fields": [
        {
            "name": "title",
            "selector": "span#productTitle",
            "type": "text",
        }
    ],
}

class UnableToCrawlProductError(Exception):
    pass

async def extract(url: str, extraction_strategy: ExtractionStrategy) -> list[dict]:
        async with AsyncWebCrawler(verbose=True,headless=True) as crawler:
            result = await crawler.arun(
                url=url,
                extraction_strategy=extraction_strategy,
                bypass_cache=True,
                verbose=False,
            )
            assert result.success, "Failed to crawl the page"

            return json.loads(result.extracted_content)

url = input("Enter the product URL: ")

try:
    details: list = asyncio.run(
        extract(
            url=url,
            extraction_strategy=JsonCssExtractionStrategy(schema=product_schema),
        )
    )

    print("My product Detail : ", details)
except UnableToCrawlProductError as e:
    print(e)
    exit(1)

Then for my use case I tried using it with fastapi in my post route which looks like below: You can start the uvicorn server and send a post request using raw json with "url" as key and any amazon product url as value; command to start server : uvicorn filename:app --reload

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel, HttpUrl
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.extraction_strategy import ExtractionStrategy
import json

product_schema = {
    "name": "Product Details",
    "baseSelector": "div#centerCol",
    "fields": [
        {
            "name": "title",
            "selector": "span#productTitle",
            "type": "text",
        }
    ],
}


app = FastAPI()

class ProductUrl(BaseModel):
    url: HttpUrl


async def extract(url: str, extraction_strategy: ExtractionStrategy) -> list[dict]:
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            verbose=False,
        )
        assert result.success, "Failed to crawl the page"

        return json.loads(result.extracted_content)

@app.get("/")
async def root():
    return {"message": "Hello test"}

@app.post("/get-details")
async def compare_product(product_url: ProductUrl):
    url = product_url.url
    try:
        product_info = await extract(
            url=url,
            extraction_strategy=JsonCssExtractionStrategy(product_schema),
        )


        return {"product_info": product_info}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))
    

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000, reload=True)

With this code i get the error when i hit the post request via postman which is below

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<Connection.run() done, defined at C:\....\product-compare\backend\venv\lib\site-packages\playwright\_impl\_connection.py:265> exception=NotImplementedError()>
Traceback (most recent call last):
  File "C:\.....\backend\venv\lib\site-packages\playwright\_impl\_connection.py", line 272, in run
    await self._transport.connect()
  File "C:\.....\backend\venv\lib\site-packages\playwright\_impl\_transport.py", line 133, in connect
    raise exc
  File "C:\...\backend\venv\lib\site-packages\playwright\_impl\_transport.py", line 120, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\.s\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\...\Python\Python310\lib\asyncio\base_events.py", line 1667, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\....\Python\Python310\lib\asyncio\base_events.py", line 498, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError

my dependencies looks like this

fastapi==0.115.6
requests==2.32.3
crawl4ai==0.3.71
validators==0.34.0
uvicorn==0.32.1

OS - windows10 machine python - 3.10.4 pip - 22.0.4

If there's something more i could provide you, please let me know if there's something I could help you with . I hope this will help you generate the same error

Aniket1026 avatar Dec 08 '24 13:12 Aniket1026

Hi @Aniket1026, cc @unclecode

I couldn’t reproduce the exact behavior you mentioned, but when following your approach, I encountered the following error:

error: Page.goto: Object of type HttpUrl is not JSON serializable  

This issue seems to be resolved by modifying the code as shown below:

class ProductUrl(BaseModel):  
    url: str  
Screenshot 2024-12-08 at 11 29 53 PM

You can handle URL validations afterward to ensure correctness.

My Environment:

  • OS: macOS Sequoia (15.1.1)

This might be more of a Windows-specific issue, though. If someone using Windows could confirm, that would be helpful! 🙇🏼

Thanks!

hitesh22rana avatar Dec 08 '24 14:12 hitesh22rana

Hi @hitesh22rana Thank you for trying it out and coming up with a suggestion. But even with the suggested changes I'm still encountering the same issue. I believe you have a different OS and this could be the reason that you don't see the same behaviour.

Aniket1026 avatar Dec 08 '24 14:12 Aniket1026

@Aniket1026,

This seems to be related to an issue with uvicorn. I found the following reference that might be helpful: GitHub Issue #964

The error occurs because FastAPI uses uvloop, and asyncio doesn’t automatically recognize this without explicitly setting a policy. There’s a helpful answer that outlines hooks to achieve this: StackOverflow thread

Please try to set the following in your code:

import asyncio
import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

Hope this helps! 🙇🏼

hitesh22rana avatar Dec 08 '24 14:12 hitesh22rana

@Aniket1026 Regarding the first part, your script works well on my side and this is the output:

[INIT].... → Crawl4AI 0.4.1
[FETCH]... ↓ https://www.amazon.com/Sapiens-Humankind-Yuval-Noa... | Status: True | Time: 15.55s
[SCRAPE].. â—† Processed https://www.amazon.com/Sapiens-Humankind-Yuval-Noa... | Time: 714ms
[EXTRACT]. â–  Completed for https://www.amazon.com/Sapiens-Humankind-Yuval-Noa... | Time: 0.5135177089832723s
[COMPLETE] â—Ź https://www.amazon.com/Sapiens-Humankind-Yuval-Noa... | Status: True | Total: 16.78s
My product Detail :  [{'title': 'Sapiens: A Brief History of Humankind'}]

For the FastAPI server, remember that the url you pass to the arun() function should be a string, and you are passing HttpUrl. Following is the full code, after a few modifications that work fine on my machine; please try it and let me know. @hitesh22rana thx for support.

import os, sys
from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel, HttpUrl
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.extraction_strategy import ExtractionStrategy
import json

product_schema = {
    "name": "Product Details",
    "baseSelector": "div#centerCol",
    "fields": [
        {
            "name": "title",
            "selector": "span#productTitle",
            "type": "text",
        }
    ],
}

class UnableToCrawlProductError(Exception):
    pass

async def extract(url: str, extraction_strategy: ExtractionStrategy) -> list[dict]:
    async with AsyncWebCrawler(verbose=True,headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=extraction_strategy,
            cache_mode=CacheMode.BYPASS,
        )
        assert result.success, "Failed to crawl the page"

        return json.loads(result.extracted_content)

# url = input("Enter the product URL: ")
# url = "https://www.amazon.com/Sapiens-Humankind-Yuval-Noah-Harari/dp/0062316095"

app = FastAPI()

class ProductUrl(BaseModel):
    url: HttpUrl

@app.get("/")
async def root():
    return {"message": "Hello test"}

@app.post("/get-details")
async def compare_product(product_url: ProductUrl):
    url = product_url.url
    try:
        product_info = await extract(
            url=str(url),
            extraction_strategy=JsonCssExtractionStrategy(product_schema),
        )


        return {"product_info": product_info}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))
    

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8081)

I tested like this:

$ curl -X POST "http://localhost:8081/get-details" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.amazon.com/Sapiens-Humankind-Yuval-Noah-Harari/dp/0062316095"}'

In 0.4.1, I am adding this step to raise and error if url is not string, I saw similar issue with another case, so better to make it explicit.

unclecode avatar Dec 09 '24 05:12 unclecode

@unclecode Thanks for looking in the issue. I tried your given solution and still the same NotImplementedError error occurs. I tried with both crawl4ai version 0.4.0 and 0.3.71( as it seemed a bit more stable when testing the script). As @hitesh22rana mentioned the issue actually lies with the uvicorn regarding how it starts sub_processes for windows .

But even the fix he provided is useful for other machine but not in my case as uvloop neither comes with uvicorn nor i could install it via pip as it's not even supported for windows machine again ,lol.

What I found finally is that when using uvicorn to start the server I can avoid using the --reload flag which actually fixes the problem but then I've to restart the server after every change as without --reload flag it doesn't looks for changes.

For now switched to using nodemon in my development environment. And the solution i found for now only works with crawl4ai==0.3.71 version. I tried it using with 0.4.0 but even with this solution of not using the --reload flag the NotImplementedError error still continues.

For now I'll stick with the 0.3.71 as it seems helpful and more stable for my use case. Thanks to both of you for looking in the issue @unclecode @hitesh22rana.

Aniket1026 avatar Dec 09 '24 09:12 Aniket1026

@Aniket1026 You’re very welcome, no worries! I definitely want to ensure you get a good response with 0.4.x. Please share as many details about your platform as you can, so I can try to simulate it and reproduce the issue myself.

unclecode avatar Dec 09 '24 13:12 unclecode

Even I am facing same issue with windows

psychicDivine avatar Dec 17 '24 17:12 psychicDivine

@psychicDivine Would you please share your code snippet as well? Thx

unclecode avatar Dec 23 '24 13:12 unclecode

hey I solved NotImplementedError in windows

import asyncio from asyncio import ProactorEventLoop from crawl4ai import AsyncWebCrawler, CacheMode import streamlit as st

asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

st.title("crawl4AI") if "crawl_result" not in st.session_state: st.session_state["crawl_result"] = ""

async def main(): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun(url ="https://www.inven.co.kr/board/lostark/6271?my=chuchu") print(result.markdown) st.session_state["crawl_result"] = result.markdown st.write(st.session_state["crawl_result"])

if name == "main": asyncio.run(main())

change your event_loop_policy ProactorEventLoopPolicy()

rech4210 avatar Dec 31 '24 07:12 rech4210

@rech4210 I will try to handle this in the code checking the environment.

unclecode avatar Jan 01 '25 12:01 unclecode

Hi,

I am experiencing the same issue on Windows. When using the AsyncWebCrawler to crawl a URL, the code raises a NotImplementedError in asyncio.create_subprocess_exec.

Here are my environment details:

  • Windows Version: [windows 11 24H2]
  • Python Version: [Python 3.11.7]
  • crawl4ai Version: [0.4.246]
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://cbr.ru/news"
    )
    print(result.markdown)

On macOS, it worked fine.

Xtreemrus avatar Jan 04 '25 18:01 Xtreemrus

@Xtreemrus Please try asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy()) at beginning of your code and let me know if it helpe?

unclecode avatar Jan 05 '25 11:01 unclecode

Even I am facing same issue with windows.

Direct running is normal, but interface calls through the FastAPI report errors.

After many attempts, I found that uvicorn cannot have the --reload parameter.

If you remove it, you can invoke the interface correctly.

I don't know why.

skywolf123 avatar Jan 17 '25 10:01 skywolf123

And using asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy()) at the beginning of the code is not valid for me.

skywolf123 avatar Jan 17 '25 10:01 skywolf123

@skywolf123 would you please explain where you have removed the "--reload"?

unclecode avatar Jan 17 '25 13:01 unclecode

@unclecode In my opinion, @skywolf123 meant was the use of --reload flag while running the fastapi app via uvicorn

E.g.

uvicorn app.main:app --reload

This seems to be the similar issue what was shared earlier by @Aniket1026 https://github.com/unclecode/crawl4ai/issues/282#issuecomment-2527408422

hitesh22rana avatar Jan 18 '25 14:01 hitesh22rana

@hitesh22rana Yes, I understood that part. I just want to know from which file or part of Crawl4AI he found this. I want to check if I missed removing --reload from somewhere after debugging, because I can’t find it in the FastAPI server within the Docker setup.

unclecode avatar Jan 18 '25 15:01 unclecode

Oh, I meant! @skywolf123 might be running their own fastapi server locally with the reload flag enabled. From there, they could be invoking a function via an endpoint, which in turn calls crawl4ai.

This doesn't seem to have anything to do with crawl4ai directly, though, as the --reload flag is likely part of their local setup rather than the Dockerized FastAPI server.

hitesh22rana avatar Jan 19 '25 08:01 hitesh22rana

I see, yes most likely you are right, thx for explanation @hitesh22rana

unclecode avatar Jan 19 '25 10:01 unclecode

Thanks @unclecode and @rech4210 , I was having the problem with NotImplementError aswell when I was using streamlit, thanks to asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy()) I was able to solve this, although I would appreciate if you could tell me what is the problem and how does changing the event loop policy helps fixing it. Thanks again!!

VAIBHAVAGARWAL12 avatar Mar 17 '25 05:03 VAIBHAVAGARWAL12

Ran in the same error when executing the crawl4ai_quickstart.ipynb on

  • windows 11
  • python 3.11 and 3.12
  • crawl4ai 0.5.0.post4

was solved when running it as an Script instead of the Jupyter Notebook https://stackoverflow.com/questions/44633458/why-am-i-getting-notimplementederror-with-async-and-await-on-windows/76981596#76981596

Einengutenmorgen avatar Mar 18 '25 10:03 Einengutenmorgen

@Aniket1026

you can do this in windows

main.py

import asyncio
import sys
import os

# **Event loop strategy must be set before importing any module**
if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

from fastapi import FastAPI
import uvicorn

# Replace with your actual routing module name
from your_router_module import router  

app = FastAPI()
app.include_router(router)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

your_router.py


from fastapi import APIRouter, HTTPException, status
from fastapi.responses import JSONResponse

from crawl4ai import *


router = APIRouter(
    prefix="/crawl4ai",
)

async def test_crawl4ai_async():
    """test crawl """
    try:
        async with AsyncWebCrawler(
            headless=True,
            verbose=False
        ) as crawler:
            result = await crawler.arun(
                url="https://www.nbcnews.com/business",
                word_count_threshold=10,
                bypass_cache=True
            )
            
            return JSONResponse({
                "status": "success",
                "method": "async",
                "data": result.markdown[:1000] + "..." if len(result.markdown) > 1000 else result.markdown,
                "full_length": len(result.markdown)
            }, status_code=status.HTTP_200_OK)
            
    except Exception as e:
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"error: {str(e)}"
        )

tuonizhysg avatar Jun 04 '25 07:06 tuonizhysg