crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Version 0.3.74 - Output of scraped website to markdown returns an error

Open kevintanhongann opened this issue 1 year ago • 5 comments

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            strategy="markdown"  # Use html2text strategy instead of default markdown
        )
        with open("micronaut_docs.md", "w", encoding="utf-8") as f:
            f.write(result.markdown)  # Use text instead of markdown property

if __name__ == "__main__":
    asyncio.run(main())

I was scraping this documentation site and it returns this error:

Error using new markdown generation strategy: cannot access local variable 'filtered_html' where it is not associated with a value

Is there a workaround for this? Thanks.

kevintanhongann avatar Nov 23 '24 14:11 kevintanhongann

+1 facing same issue

b-sai avatar Nov 23 '24 17:11 b-sai

I encountered this too, it's an easy fix so I made a pull request. To workaround locally I believe you can clone the repository, also make a one-line change in markdown_generation_strategy.py at line 104, change

fit_html=filtered_html

to be

fit_html=filtered_html or None

Then in the cloned repository folder, do

pip install -e .

Which will update your local crawl4ai with the local fix(but later once crawl4ai update their package you need to re-install the official package)

leonson avatar Nov 23 '24 19:11 leonson

Also getting this error as well.

chanmathew avatar Nov 24 '24 02:11 chanmathew

Facing this error too +1

adam-pb avatar Nov 25 '24 20:11 adam-pb

@kevintanhongann @chanmathew @adam-pb @leonson @b-sai Hello everybody, I made some changes. The code is running without any issues right now. Please wait. I released the new version tonight: 0.3.743. Then the code you have to use; please pay attention to the code below. Some of the code you guys shared is not correct.

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
        )
        print(len(result.markdown_v2.raw_markdown))

        # For compatibility with previous versions, still you can have it like below:
        # print(len(result.markdown))

if __name__ == "__main__":
    asyncio.run(main())

As you can see you do not need to pass anything. Btw I suggest to check result.markdown_v2.markdown_with_citations and result.markdown_v2.references_markdown.

To set the markdown generator strategy you can follow his code:

result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator()
)

One more thing, if you want to use this experimental feature we're working on, it's called Fit Markdown, and what it does is basically it's a subset of the main Markdown, but with less noise. It tries to remove whatever is not relevant to the main purpose of the page. To activate that one, follow the code below, but remember, this is experimental, you.

    async with AsyncWebCrawler(
        headless=True,  
        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
        )
        print(len(result.markdown_v2.fit_markdown))

By the way, such a long document 😅, the length of extracted markdown is 1,166,105 characters, and the scraping procedure took around 20 seconds, which is pretty fast for such a long document. Anyway, let me know if any issues you guys have.

unclecode avatar Nov 27 '24 11:11 unclecode

Issue resolved in newer versions.

aravindkarnam avatar Jan 31 '25 18:01 aravindkarnam