[Bug]: Wrong URL variable used for extraction of raw html
crawl4ai version
0.6.3
Expected Behavior
Extraction sends the preferred content (in my case markdown) and url for extraction.
Current Behavior
https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L610 When using raw html, the url variable contains the full html content
Above, the _url variable is assigned and takes into account raw html. I suspect that this _url variable should be used instead of url.
This issue causes much more (in my case 5x) content sent to llm.
Is this reproducible?
Yes
Inputs Causing the Bug
use relatively large raw html as input.
Steps to Reproduce
Look at llm usage compared to the markdown size (I was doing rough calc of string length / 4)
Code snippets
OS
Linux
Python version
3.11
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Related to above, I think the following line would also need _url instead of url
https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py#L478
At that point in that function (aprocess_html), are there valid reasons url variable which contains full raw html anywhere (when using raw)? That full html is in html var. But plenty of url usage in that function (as opposed to _url)
I'm having this problem as well. There is essentially no way to pass in raw:// html and have the LLM use the generated markdown.
Here is the token count when I ensure fit markdown is being used by passing a url instead of raw html:
=== Token Usage Summary ===
Type Count
------------------------------
Completion 365
Prompt 2,629
Total 2,994
Then when I pass in the raw html these are the token counts I get :
=== Token Usage Summary ===
Type Count
------------------------------
Completion 637
Prompt 38,368
Total 39,005
@stevefusaro in the meantime, feel free to use my fork until this is addressed here: https://github.com/djl0/crawl4ai (pip install git+https://github.com/djl0/crawl4ai.git)
And just to clarify something from your comment, I believe it actually does pass the markdown content, however it ALSO sends the full raw html in the url variable (which obviously wastes a ton of tokens, and I also imagine the llms handle markdown better anyway)
I’ve opened a PR to fix this: #1447