crawl4ai
crawl4ai copied to clipboard
[Bug]: Incorrect crawlered code format of `import xxx`
crawl4ai version
0.5.0.post2
Expected Behavior
When crawling code blocks from the triton tutorial page: https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py, space between import and the package name is omitted.
The webpage contains several import statement:
import torch
import triton
import triton.language as tl
The crawlered results should contain exactly same code snippet.
Current Behavior
The crawlered import-related results are
importtorch
importtriton
importtriton.languageastl
Is this reproducible?
Yes
Inputs Causing the Bug
A simple reproducible script:
import asyncio
from crawl4ai import *
async def main():
# Create an instance of AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(url="https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py")
# Print the extracted content
print(result.markdown)
asyncio.run(main())
Steps to Reproduce
run the script, and you can find the incorrect `import` results.
Code snippets
OS
macOS
Python version
3.11.11
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response