crawl4ai
crawl4ai copied to clipboard
(Question) How to retain specific HTML tags (e.g., <span class="entity-embed">) in HTML-to-Markdown conversion without converting them?
I'm working on a web crawling project where I need to convert HTML content into Markdown. However, I want certain HTML tags, like , to remain in their original HTML form in the Markdown output, without being converted.
Currently, when I run the conversion, all tags are transformed into Markdown, which removes specific structures I need to keep intact. Is there a way to retain specific tags or classes during the HTML-to-Markdown conversion?
Hi @truonghoangnguyen Thanks for the suggestion, I actually took it into consideration and updated the code. It's now available in branch 0.3.73 and will soon be pushed to the main branch. Thanks again.
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(
headless=True,
sleep_on_close = True ,
) as crawler:
result = await crawler.arun(
url="https://crawl4ai.com",
bypass_cache=True,
html2text={
'preserve_tags': ['h2']
},
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
I used this usage in the latest version, however it did not take effect
@ddnomber checkout my code (keep video tag):
preserve_tags = {"html2text":{"preserve_tags": ["video"]}}
kw = {} #crawl4ai_config(url=URL)
kw.update(preserve_tags)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
**kw,# $"#main > article",
bypass_cache=True
)
Here's my code, the h1 and h2 tags are still converted to markdwon format, I'm not sure what the problem is
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=40,
excluded_tags=['nav', 'footer', 'aside'],
remove_overlay_elements=True,
js_code="window.scrollTo(0, document.body.scrollHeight);",
cache_mode = CacheMode.ENABLED,
html2text={
'preserve_tags': ['h1','h2']
},
exclude_social_media_links=True,
exclude_social_media_domains=[
"facebook.com","twitter.com","instagram.com","linkedin.com","youtube.com","tiktok.com",
]
)
print(result.markdown)
@truonghoangnguyen @ddnomber And everyone else interested in this, the new way to use this in 0.4.x:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
options={
'preserve_tags': ['h2']
}
)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url='https://crawl4ai.com',
config=crawl_config
)
if result.success:
markdown = result.markdown_v2.raw_markdown
assert '<h2>' in markdown
assert '</h2>' in markdown
if __name__ == "__main__":
asyncio.run(main())