crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: cleaned_html returned without classes and ids for elements.

Open igoralentyev opened this issue 1 month ago • 2 comments

crawl4ai version

0.7.4

Expected Behavior

Expected to see html with classes and ids, otherwise i cant use it for further analyzing

Current Behavior

No classes/ids

Is this reproducible?

Yes

Inputs Causing the Bug

crawler_config = CrawlerRunConfig(
        exclude_all_images=True,
        excluded_tags=['header', 'footer', 'meta', 'script', 'style'],
        excluded_selector=excluded_selector,  # Add excluded_selector support
        remove_overlay_elements=False,
        keep_data_attributes=True,
        wait_for="js:() => { return new Promise(resolve => setTimeout(() => resolve(true), 5000)); console.log('Waiting for 5 seconds'); }",
        # delay_before_return_html=3,
        locale="en-US",
        magic=True,
        cache_mode=CacheMode.DISABLED,
    )

Steps to Reproduce


Code snippets


OS

linus

Python version

3.12

Browser

default used, did not specified

Browser version

No response

Error logs & Screenshots (if applicable)

No response

igoralentyev avatar Nov 10 '25 13:11 igoralentyev

Actually its pretty easy to fix/patch.

You just need to change IMPORTANT_ATTRS at 50 line in config.py lib file.

Like this

IMPORTANT_ATTRS = ["src", "href", "class", "id"] # Modified: removed alt, title, width, height - added class, id

Result: this fix paired with keep_data_attributes=False returns really clean html with classes and ids

igoralentyev avatar Nov 10 '25 14:11 igoralentyev

Hello @igoralentyev could you fix it and send a pull request. if that is possible

Ahmed-Tawfik94 avatar Nov 12 '25 02:11 Ahmed-Tawfik94