crawl4ai
crawl4ai copied to clipboard
[Bug]: cleaned_html returned without classes and ids for elements.
crawl4ai version
0.7.4
Expected Behavior
Expected to see html with classes and ids, otherwise i cant use it for further analyzing
Current Behavior
No classes/ids
Is this reproducible?
Yes
Inputs Causing the Bug
crawler_config = CrawlerRunConfig(
exclude_all_images=True,
excluded_tags=['header', 'footer', 'meta', 'script', 'style'],
excluded_selector=excluded_selector, # Add excluded_selector support
remove_overlay_elements=False,
keep_data_attributes=True,
wait_for="js:() => { return new Promise(resolve => setTimeout(() => resolve(true), 5000)); console.log('Waiting for 5 seconds'); }",
# delay_before_return_html=3,
locale="en-US",
magic=True,
cache_mode=CacheMode.DISABLED,
)
Steps to Reproduce
Code snippets
OS
linus
Python version
3.12
Browser
default used, did not specified
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Actually its pretty easy to fix/patch.
You just need to change IMPORTANT_ATTRS at 50 line in config.py lib file.
Like this
IMPORTANT_ATTRS = ["src", "href", "class", "id"] # Modified: removed alt, title, width, height - added class, id
Result: this fix paired with keep_data_attributes=False returns really clean html with classes and ids
Hello @igoralentyev could you fix it and send a pull request. if that is possible