crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Markdown output has incorect spacing.

Open dkampien opened this issue 11 months ago • 5 comments

crawl4ai version

0.4.247

Expected Behavior

Im trying to scrape a page from the blender manual @ https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html

The markdown should look a little more like this (scraped with jina-ai):

Image

Notice the spacing between paragraphs.

Current Behavior

Instead it messes up the spacing like so:

Image

Notice that the spacing between paragraphs is messed up. LLMs can pick up this paragraph proximity.

Is there any config in CrawlRunConfig that I should know that can fix this? @aravindkarnam @unclecode

Is this reproducible?

Yes

Inputs Causing the Bug

https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html

Steps to Reproduce


Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig


async def main():
    browser_config = BrowserConfig()  # Default browser configuration
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        css_selector="#furo-main-content",
        excluded_selector=".toc-drawer, a.headerlink"
    )   # Default crawl run configuration


    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html",
            config=run_config
        )
        
        # Export to markdown file
        with open('output.md', 'w', encoding='utf-8') as f:
            f.write(result.markdown)  # Write markdown content to file

if __name__ == "__main__":
    asyncio.run(main())

OS

macos

Python version

3.11.9

Browser

Edge

Browser version

No response

Error logs & Screenshots (if applicable)

No response

dkampien avatar Feb 01 '25 01:02 dkampien

RCA:

The HTML tags corresponding to the sections in question is dl, dt and dd tags ( description list). Image

Currently these being handled by the custom html2text package (crawl4ai/html2text/__init__.py). handle_tag function. The exact code causing this problem, is as follows

    if tag == "dl" and start:
            self.p()
        if tag == "dt" and not start:
            self.pbr()
        if tag == "dd" and start:
            self.o("    ")
        if tag == "dd" and not start:
            self.pbr()

The issue occurs because:

  1. After each dt ends, we add a line break (self.pbr())
  2. When dd starts, we only add indentation with no spacing control
  3. After each dd ends, we add another line break
  4. The self.p_p counter that controls paragraph breaks isn't being properly managed between terms and definitions

Fix Suggestions

Courtesy of claude sonet, the following changes fixes the markdown as expected (dt and dd, the term and corresponding descriptions positioned together, rather than with preceding or upcoming descriptions)

        if tag == "dl" and start:
            self.p()  # Add paragraph break before list starts
            self.p_p = 0  # Reset paragraph state
        
        elif tag == "dt" and start:
            if self.p_p == 0:  # If not first term
                self.o("\n\n")  # Add spacing before new term-definition pair
            self.p_p = 0  # Reset paragraph state
        
        elif tag == "dt" and not start:
            self.o("\n")  # Single newline between term and definition
        
        elif tag == "dd" and start:
            self.o("    ")  # Indent definition
        
        elif tag == "dd" and not start:
            self.p_p = 0

I have verified that following produces the expected output. This just needs a little bit more refinement and testing.


Call for contributors

We are on the lookout for talented Open source contributors. Now this one is a simple fix. If you are beginner, you can bag your first open source contribution(and we want that for you 😉). Comment below "Interested" and issue will be assigned to you.

aravindkarnam avatar Feb 02 '25 16:02 aravindkarnam

interested @aravindkarnam

tautik avatar Feb 02 '25 23:02 tautik

@tautikAg Thanks for showing interest. Next release is by Feb-15th, so plan to raise a PR 2-3 days in advance.

aravindkarnam avatar Feb 03 '25 00:02 aravindkarnam

@tautikAg Hi. Were you able to make progress on this?

aravindkarnam avatar Feb 10 '25 04:02 aravindkarnam

hey @aravindkarnam , i am testing rn. WIll tag you to the PR soon (in few hrs)

tautik avatar Feb 11 '25 07:02 tautik