[Bug]: Markdown output has incorect spacing.
crawl4ai version
0.4.247
Expected Behavior
Im trying to scrape a page from the blender manual @ https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html
The markdown should look a little more like this (scraped with jina-ai):
Notice the spacing between paragraphs.
Current Behavior
Instead it messes up the spacing like so:
Notice that the spacing between paragraphs is messed up. LLMs can pick up this paragraph proximity.
Is there any config in CrawlRunConfig that I should know that can fix this? @aravindkarnam @unclecode
Is this reproducible?
Yes
Inputs Causing the Bug
https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html
Steps to Reproduce
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="#furo-main-content",
excluded_selector=".toc-drawer, a.headerlink"
) # Default crawl run configuration
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html",
config=run_config
)
# Export to markdown file
with open('output.md', 'w', encoding='utf-8') as f:
f.write(result.markdown) # Write markdown content to file
if __name__ == "__main__":
asyncio.run(main())
OS
macos
Python version
3.11.9
Browser
Edge
Browser version
No response
Error logs & Screenshots (if applicable)
No response
RCA:
The HTML tags corresponding to the sections in question is dl, dt and dd tags ( description list).
Currently these being handled by the custom html2text package (crawl4ai/html2text/__init__.py). handle_tag function. The exact code causing this problem, is as follows
if tag == "dl" and start:
self.p()
if tag == "dt" and not start:
self.pbr()
if tag == "dd" and start:
self.o(" ")
if tag == "dd" and not start:
self.pbr()
The issue occurs because:
- After each
dtends, we add a line break (self.pbr()) - When
ddstarts, we only add indentation with no spacing control - After each
ddends, we add another line break - The
self.p_pcounter that controls paragraph breaks isn't being properly managed between terms and definitions
Fix Suggestions
Courtesy of claude sonet, the following changes fixes the markdown as expected (dt and dd, the term and corresponding descriptions positioned together, rather than with preceding or upcoming descriptions)
if tag == "dl" and start:
self.p() # Add paragraph break before list starts
self.p_p = 0 # Reset paragraph state
elif tag == "dt" and start:
if self.p_p == 0: # If not first term
self.o("\n\n") # Add spacing before new term-definition pair
self.p_p = 0 # Reset paragraph state
elif tag == "dt" and not start:
self.o("\n") # Single newline between term and definition
elif tag == "dd" and start:
self.o(" ") # Indent definition
elif tag == "dd" and not start:
self.p_p = 0
I have verified that following produces the expected output. This just needs a little bit more refinement and testing.
Call for contributors
We are on the lookout for talented Open source contributors. Now this one is a simple fix. If you are beginner, you can bag your first open source contribution(and we want that for you 😉). Comment below "Interested" and issue will be assigned to you.
interested @aravindkarnam
@tautikAg Thanks for showing interest. Next release is by Feb-15th, so plan to raise a PR 2-3 days in advance.
@tautikAg Hi. Were you able to make progress on this?
hey @aravindkarnam , i am testing rn. WIll tag you to the PR soon (in few hrs)