crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

fix: Add newline before pre codeblock start

Open jtanningbed opened this issue 11 months ago • 2 comments

Sometimes there is additional text before the 'pre' code block begins : image

This results in a malformed generated markdown: image

This just adds a newline before the start of the code block, should be fairly inconsequential.

New result: image

There's another issue with handling whitespaces that are defined like: <span class="w"> </span>, seems like the parser is sending the entire content to handle_data if the code block is something like

<code>
    "python"
    <span class="w"> </span>
    "main.py"
</code>

the html will be parsed with handle_data("pythonmain.py"), so the whitespace is not preserved. That seems like it would require updating the underlying HTMLParser logic or overriding it so I didn't try to address it myself. Just wanted to raise it in case it wasn't known. example: image

jtanningbed avatar Jan 15 '25 22:01 jtanningbed

I would like to propose a more robust solution:

if tag == 'pre':
    if start:
        # Always start a new code block with a new line. Otherwise, it may not be rendered correctly.
        if not self.lastWasNL:
            self.o('\n')
        self.o('```\n')  # Markdown code block start
        self.inside_pre = True
    else:
        # Avoid adding unnecessary new lines at the end of the code block.
        if not self.lastWasNL:
            self.o('\n')
        self.o('```\n')  # Markdown code block end
        self.inside_pre = False

dmurat avatar Jan 24 '25 08:01 dmurat

@jtanningbed @dmurat Thanks for sharing. This seems important to look. I will check it very soon.

unclecode avatar Jan 25 '25 12:01 unclecode