crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

(Question) How to retain specific HTML tags (e.g., <span class="entity-embed">) in HTML-to-Markdown conversion without converting them?

Open truonghoangnguyen opened this issue 1 year ago • 1 comments

I'm working on a web crawling project where I need to convert HTML content into Markdown. However, I want certain HTML tags, like ..., to remain in their original HTML form in the Markdown output, without being converted.

Currently, when I run the conversion, all tags are transformed into Markdown, which removes specific structures I need to keep intact. Is there a way to retain specific tags or classes during the HTML-to-Markdown conversion?

truonghoangnguyen avatar Oct 30 '24 02:10 truonghoangnguyen

Hi @truonghoangnguyen Thanks for the suggestion, I actually took it into consideration and updated the code. It's now available in branch 0.3.73 and will soon be pushed to the main branch. Thanks again.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(
        headless=True, 
        sleep_on_close = True ,
    ) as crawler:
        result = await crawler.arun(
            url="https://crawl4ai.com",
            bypass_cache=True,
            html2text={
                'preserve_tags': ['h2']
            },
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Nov 03 '24 16:11 unclecode

I used this usage in the latest version, however it did not take effect

ddnomber avatar Dec 27 '24 17:12 ddnomber

@ddnomber checkout my code (keep video tag):

preserve_tags = {"html2text":{"preserve_tags": ["video"]}}
kw = {} #crawl4ai_config(url=URL)
kw.update(preserve_tags)

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(
        **kw,#  $"#main > article", 
        bypass_cache=True
    )

truonghoangnguyen avatar Dec 28 '24 02:12 truonghoangnguyen

Here's my code, the h1 and h2 tags are still converted to markdwon format, I'm not sure what the problem is

async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=40,
            excluded_tags=['nav', 'footer', 'aside'],
            remove_overlay_elements=True,
            js_code="window.scrollTo(0, document.body.scrollHeight);",
            cache_mode = CacheMode.ENABLED,
            html2text={
                'preserve_tags': ['h1','h2']
            },
            exclude_social_media_links=True,
            exclude_social_media_domains=[
                "facebook.com","twitter.com","instagram.com","linkedin.com","youtube.com","tiktok.com",
            ]
        )
        print(result.markdown)

ddnomber avatar Dec 28 '24 08:12 ddnomber

@truonghoangnguyen @ddnomber And everyone else interested in this, the new way to use this in 0.4.x:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            options={
                'preserve_tags': ['h2']
            }
        )
        
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url='https://crawl4ai.com',
            config=crawl_config
        )
        if result.success:
            markdown = result.markdown_v2.raw_markdown
            assert '<h2>' in markdown
            assert '</h2>' in markdown

if __name__ == "__main__":
    asyncio.run(main())

unclecode avatar Dec 28 '24 12:12 unclecode