crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Incorrect crawlered code format of `import xxx`

Open haoyang9804 opened this issue 6 months ago • 2 comments

crawl4ai version

0.5.0.post2

Expected Behavior

When crawling code blocks from the triton tutorial page: https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py, space between import and the package name is omitted.

The webpage contains several import statement:

import torch

import triton
import triton.language as tl

The crawlered results should contain exactly same code snippet.

Current Behavior

The crawlered import-related results are

importtorch
importtriton
importtriton.languageastl

Is this reproducible?

Yes

Inputs Causing the Bug

A simple reproducible script:


import asyncio
from crawl4ai import *

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py")

        # Print the extracted content
        print(result.markdown)

asyncio.run(main())

Steps to Reproduce

run the script, and you can find the incorrect `import` results.

Code snippets


OS

macOS

Python version

3.11.11

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

haoyang9804 avatar Jun 04 '25 03:06 haoyang9804