crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

cannot access local variable 'filtered_html"

Open valtahomes opened this issue 1 year ago • 9 comments

Hi,

New here. Can't run the sample code with the error:

code: import asyncio from crawl4ai import AsyncWebCrawler

async def main(): # Create an instance of AsyncWebCrawler async with AsyncWebCrawler(verbose=True) as crawler: # Run the crawler on a URL result = await crawler.arun(url="https://www.nbcnews.com/business")

    # Print the extracted content
    print(result.markdown)

Run the async main function

asyncio.run(main())

Got the following error:

[INIT].... → Crawl4AI 0.3.741 [FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 0.02s [SCRAPE].. ◆ Processed https://www.nbcnews.com/business... | Time: 39ms [COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 0.07s Error using new markdown generation strategy: cannot access local variable 'filtered_html' where it is not associated with a value

any idea? Thanks.

valtahomes avatar Nov 24 '24 04:11 valtahomes

Plus one here

didntpay avatar Nov 24 '24 05:11 didntpay

I am also facing the same issue.

Krish-Goyani avatar Nov 24 '24 13:11 Krish-Goyani

+1

OctAg0nO avatar Nov 24 '24 18:11 OctAg0nO

I have a PR to fix this but not merged so far. Workaround could be one of this:

Use an older version

pip install --force-reinstall -v "crawl4ai==0.3.731"

Fix locally

  1. git clone and fix locally as I did
  2. Do pip install -e . (update package with local change)

leonson avatar Nov 24 '24 20:11 leonson

+1

vetharupini avatar Nov 25 '24 10:11 vetharupini

+1

marioguima avatar Nov 25 '24 19:11 marioguima

Similarly having this issue

Ches-ctrl avatar Nov 26 '24 16:11 Ches-ctrl

+1

lexang avatar Nov 27 '24 05:11 lexang

Hey everyone, sorry for the inconvenience, already merged the PR, thx @leonson. Btw, the version 0.3.743 will have the patch; I'll release it tonight. For detailed explanation please check this issue, I have explained in details https://github.com/unclecode/crawl4ai/issues/287#issuecomment-2503669235

However, make sure to update ot the latest version by tomorrow and try this:

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="URL",
            cache_mode=CacheMode.BYPASS,
        )
        print(len(result.markdown_v2.raw_markdown))

        # For compatibility with previous versions, still you can have it like below:
        # print(len(result.markdown))

if __name__ == "__main__":
    asyncio.run(main())

@valtahomes @didntpay @Krish-Goyani @OctAg0nO @vetharupini @marioguima @Ches-ctrl @lexang

unclecode avatar Nov 27 '24 12:11 unclecode

Closing this issue, since the patch is already released!

aravindkarnam avatar Feb 14 '25 04:02 aravindkarnam