crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

httpx logs at INFO level are printed after importing crawl4ai

Open uamin-qlu opened this issue 11 months ago • 9 comments

When i import crawl4ai and make any httpx request, the INFO level logs for that request are printed on the terminal. I have to suppress httpx logs explicitly to stop them from printing

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import httpx

import logging

logging.getLogger("httpx").setLevel(logging.WARNING)
# logging.getLogger("httpcore").setLevel(logging.WARNING)
# logging.basicConfig(level=logging.WARNING)

async def main():
   browser_config = BrowserConfig(
       verbose=False,
       text_mode = True
   )

   md_generator = DefaultMarkdownGenerator(
       options={
           "ignore_links": True,
           "escape_html": False,
           "skip_internal_links": True
       }
   )

   run_config = CrawlerRunConfig(
       markdown_generator=md_generator,
       excluded_tags=['form', 'header', 'input', 'button', 'nav', 'footer', 'a'],
       exclude_external_links=True,
       remove_overlay_elements=True,
       cache_mode=False,
       verbose=False
   )

   async with AsyncWebCrawler(config=browser_config, verbose=False) as crawler:
       result = await crawler.arun(
           url="http://www.example.com",
           config=run_config
       )

       if result.success:
           # Print clean content
           print(f"LEN CONTENT = {len(result.markdown)}")
           # print("Content:", result.markdown) 
           return result.markdown
       else:
           print(f"Crawl failed: {result.error_message}")



data = await main()
httpx.get("https://www.google.com")

uamin-qlu avatar Jan 14 '25 06:01 uamin-qlu

hi @uamin-qlu ,

Could you plz try with the code that I attached as a text file to this thread...

Please share your feedback post running the code...

Thank You! import_asyncio.txt

devatbosch avatar Jan 15 '25 07:01 devatbosch

face similar issue here and not just httpx request logger. almost all internal deps logger that I have like pika adapter logger, azure cloud SDK logger and more are printed which I dont need after importing crawl4ai. something definitely wrong my guess this one is caused these setting:

  • https://github.com/unclecode/crawl4ai/blob/8878b3d032fb21ce3567b34db128bfa64687198a/main.py#L38
  • https://github.com/unclecode/crawl4ai/blob/8878b3d032fb21ce3567b34db128bfa64687198a/crawl4ai/async_database.py#L18

PrabuDzak avatar Jan 15 '25 08:01 PrabuDzak

@devatbosch, thanks for getting in touch. I tried running the code in the text file, it still gives me logs if i remove the statements that explicitly set the httpx logs to WARNING and above. I am hoping for a solution where i do not need to change any logging settings for my server. Pls let me know if that is possible.

uamin-qlu avatar Jan 15 '25 08:01 uamin-qlu

@uamin-qlu Can you share the output log you received? I don't see anything on my end. @PrabuDzak, the same goes for you; could you please share a code snippet and a sample of the output logs? On my end even without any change on https log level, the only thing I see is "LEN CONTENT = 170".

unclecode avatar Jan 15 '25 14:01 unclecode

@unclecode something like this should represent enough to isolate replicate the issue

#!/usr/bin/env python

import crawl4ai
import httpx


def main():
    httpx.get("https://www.google.com")
    httpx.get("https://www.github.com")


if __name__ == "__main__":
    main()

there should be no log printed when script above executed but when crawl4ai imported, httpx internal logger will print all log INFO of all request made. something like this this:

INFO:httpx:HTTP Request: GET https://www.google.com "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://www.github.com "HTTP/1.1 301 Moved Permanently"

when crawl4ai import commented/removed, the internal httpx log gone. the example above is httpx but this affect almost all other module internal logging

this issue create unnecessary of noise when trying to debug reading live env logs

PrabuDzak avatar Jan 15 '25 15:01 PrabuDzak

@PrabuDzak I don't receive such logs on my output. Can you create a new virtual environment and install only Crawl4ai? Then try it and let me know if this issue occurs or not. I guess this issue comes from another library on your machine that conflicts with some of the dependencies that Crawl4ai has, but it doesn't originate directly from Crawl4ai.

import crawl4ai
import httpx


def main():
    print("before")
    httpx.get("https://www.google.com")
    httpx.get("https://www.github.com")
    print("after")


if __name__ == "__main__":
    main()

Output

before
after

unclecode avatar Jan 16 '25 12:01 unclecode

@unclecode tried running on a clean venv installing just crawl4ai and httpx still get printed.

also tried running on a clean docker container and also still get printed.

main.py

import crawl4ai
import httpx


def main():
    print("before")
    httpx.get("https://www.google.com")
    httpx.get("https://www.github.com")
    print("after")


if __name__ == "__main__":
    main()

Dockerfile

FROM python:3.10-slim

RUN pip install crawl4ai httpx

ADD main.py main.py

CMD ["python", "main.py"]

Docker run

$ docker build -t crawl4ailog .
[+] Building 1.9s (8/8) FINISHED                                                                                                                                 docker:default
 => [internal] load build definition from Dockerfile                                                                                                                       0.0s
 => => transferring dockerfile: 139B                                                                                                                                       0.0s
 => [internal] load metadata for docker.io/library/python:3.10-slim                                                                                                        1.7s
 => [internal] load .dockerignore                                                                                                                                          0.0s
 => => transferring context: 2B                                                                                                                                            0.0s
 => [1/3] FROM docker.io/library/python:3.10-slim@sha256:a636f5aafba3654ac4d04d7c234a75b77fa26646fe0dafe4654b731bc413b02f                                                  0.0s
 => => resolve docker.io/library/python:3.10-slim@sha256:a636f5aafba3654ac4d04d7c234a75b77fa26646fe0dafe4654b731bc413b02f                                                  0.0s
 => [internal] load build context                                                                                                                                          0.0s
 => => transferring context: 238B                                                                                                                                          0.0s
 => CACHED [2/3] RUN pip install crawl4ai httpx                                                                                                                            0.0s
 => [3/3] ADD main.py main.py                                                                                                                                              0.0s
 => exporting to image                                                                                                                                                     0.0s
 => => exporting layers                                                                                                                                                    0.0s
 => => writing image sha256:dd925e8d891a16420bc2e6980dccb4bb4d3d5c71530f3cd1d84ef72db3505867                                                                               0.0s
 => => naming to docker.io/library/crawl4ailog                                                                                                                             0.0s
$ docker run crawl4ailog
before
INFO:httpx:HTTP Request: GET https://www.google.com "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://www.github.com "HTTP/1.1 301 Moved Permanently"
after
$ 

PrabuDzak avatar Jan 16 '25 12:01 PrabuDzak

Image

New venv, installed only crawl4ai

uamin-qlu avatar Jan 16 '25 13:01 uamin-qlu

@PrabuDzak are you using it in interactive notebook?

unclecode avatar Jan 16 '25 13:01 unclecode

@unclecode no, I'm not. I'm running production on docker

PrabuDzak avatar Jan 16 '25 14:01 PrabuDzak

@PrabuDzak bcoz when I look at the image, it looks like in the colab or an interactive notebook on VScode, and that actually may be the reason, I hvnt tested in that situation

unclecode avatar Jan 16 '25 14:01 unclecode

hi @unclecode thanks for looking into this, I am facing the same issue; here's my replication of the issue:

testing with python debugger; python 3.12.8; Windows 11 It is NOT an interactive notebook / shell. Crawl4AI==0.4.247 httpx==0.27.2

Importing crawl4ai: Image

Disabling crawl4ai by commenting it out: Image

And here's without python debugger just for good measures:

Image

mjoalameen avatar Jan 16 '25 15:01 mjoalameen

@mjoalameen Thx for details, really helpful, very helpful, and I found the issue, resolved, in next version when I release soon.

unclecode avatar Jan 17 '25 09:01 unclecode