crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Wrong URL variable used for extraction of raw html

Open djl0 opened this issue 6 months ago • 6 comments

crawl4ai version

0.6.3

Expected Behavior

Extraction sends the preferred content (in my case markdown) and url for extraction.

Current Behavior

https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L610 When using raw html, the url variable contains the full html content

Above, the _url variable is assigned and takes into account raw html. I suspect that this _url variable should be used instead of url.

This issue causes much more (in my case 5x) content sent to llm.

Is this reproducible?

Yes

Inputs Causing the Bug

use relatively large raw html as input.

Steps to Reproduce

Look at llm usage compared to the markdown size (I was doing rough calc of string length / 4)

Code snippets


OS

Linux

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

djl0 avatar May 14 '25 23:05 djl0

Related to above, I think the following line would also need _url instead of url https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L478

At that point in that function (aprocess_html), are there valid reasons url variable which contains full raw html anywhere (when using raw)? That full html is in html var. But plenty of url usage in that function (as opposed to _url)

djl0 avatar May 15 '25 00:05 djl0

I'm having this problem as well. There is essentially no way to pass in raw:// html and have the LLM use the generated markdown.

Here is the token count when I ensure fit markdown is being used by passing a url instead of raw html:

=== Token Usage Summary ===
Type                   Count
------------------------------
Completion               365
Prompt                 2,629
Total                  2,994

Then when I pass in the raw html these are the token counts I get :

=== Token Usage Summary ===
Type                   Count
------------------------------
Completion               637
Prompt                38,368
Total                 39,005

stevefusaro avatar May 18 '25 18:05 stevefusaro

@stevefusaro in the meantime, feel free to use my fork until this is addressed here: https://github.com/djl0/crawl4ai (pip install git+https://github.com/djl0/crawl4ai.git)

And just to clarify something from your comment, I believe it actually does pass the markdown content, however it ALSO sends the full raw html in the url variable (which obviously wastes a ton of tokens, and I also imagine the llms handle markdown better anyway)

djl0 avatar May 19 '25 01:05 djl0

I’ve opened a PR to fix this: #1447

rbushri avatar Aug 28 '25 08:08 rbushri