crawl4ai Base64 image format not parsed

Base64 image format not parsed

Open ZhangTianrong opened this issue 1 year ago • 1 comments

In WebScrappingStrategy, images are extracted based on the very few possible ways they can get scores in score_image_for_usefulness. Having a preferable format name might be one of the easiest, but base64 images are excluded because their format names were never parsed. A simple change like

                image_src = img.get('src','')
                if "data:image/" in image_src:
                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
                else:
                    image_format = os.path.splitext(image_src)[1].lower()

seems enough to fix this.

Background: I would like craw4ai to be able to both process the html files and crawl with Playwright at the same time like Wallabag or Omnivore. I have got local html files downloaded with SingleFile, which keeps a copy of whatever is rendered in the browser in a WYSIWYG manner and encode images in Base64 to keep the resulting file portable. However, crawl4ai won't extract the images in base64.

Oct 19 '24 04:10 ZhangTianrong

@ZhangTianrong Thanks for using our library. You're absolutely right; I think this is something we missed. Thank you for your suggestion. I think it's a very good suggestion. I already added it to the library, and I'm going to put it out in the new version 0.3.72 very soon. Hopefully, we'll be releasing it by tonight or tomorrow. Thank you so much.

Oct 20 '24 11:10 unclecode

The version that supports base64 images is now released.

Jan 22 '25 12:01 aravindkarnam

crawl4ai crawl4ai copied to clipboard

Base64 image format not parsed

crawl4ai
crawl4ai copied to clipboard