Issues in image processing logic in content_scraping_strategy.py

Open dmurat opened this issue 1 year ago • 1 comments

Issue 1: Index Out of Bounds for Relative Image URLs

When processing images with exclude_external_images=True, the code attempts to split image URLs to compare domains, but fails for relative URLs (e.g., 'assets/logo.svg'). The current implementation assumes all URLs have at least 3 segments when split by '/', causing an index out of bounds error.

Current code: line 398 in content_scraping_strategy.py

src_url_base = src.split('/')[2]  # Fails for relative URLs like 'assets/logo.svg'

Steps to reproduce:

Use crawl4ai to scrape a page containing relative image URLs
Set exclude_external_images=True
Observe the error in logs: "Error processing element: exceptions must derive from BaseException"

For example:

    ...
    async with AsyncWebCrawler(verbose=True) as crawler:
        crawl_result = await crawler.arun(
            url="https://docs.astral.sh/uv/",
            exclude_external_links=True,
            exclude_external_images=True, 
            magic=True,
            cache_mode=CacheMode.BYPASS,
            verbose=True,
        )

Issue 2: Incorrect Exception Raising

The error handling code raises a string instead of an Exception object, which is invalid I think:

Line 418 in content_scraping_strategy.py

except Exception as e:
    raise "Error processing images"

This results in the log error message "[SCRAPE].. ◆ Error processing element: exceptions must derive from BaseException".

I'm python newbie so my observations may be wrong. Please take this into account. Tnx

Dec 13 '24 10:12 dmurat

@dmurat Thank you for your close attention to the code base and your speculation. I'm going to add that to the backlog, and by tomorrow, we'll definitely check it and see what's wrong. Thanks for sharing your code sample as well.

Dec 13 '24 13:12 unclecode