scrapely icon indicating copy to clipboard operation
scrapely copied to clipboard

safehtml omit some important (all) attributes of tags

Open SirbitoX opened this issue 9 years ago • 2 comments

Let's consider that someone (like me) want to keep an img tag so the src attribute of this tag would be important for him/her. But safehtml() function omit all the attributes of the relevant tag. I think it would better to keep attributes of allowed_tags or add another param named allowed_attributes to specify which attributes to keep.

SirbitoX avatar Aug 06 '15 06:08 SirbitoX

Hi @SirbitoX. I was having a discussion about this last week and we were thinking about adding a new less strict version of safe html. The new type would be somewhere between raw html and safe html keeping img tags and possibly other tags too.

Other than img tags what other tags do you add? Would you mind explaining your specific use case? Are you extracting articles or products or leads?

ruairif avatar Aug 06 '15 14:08 ruairif

Hi @ruairif, I'm extracting articles and I keep all the images in the description of scraped article so to do this I would need the src attribute or even height and width attributes of the img tag. Probably I plan to keep the embed videos in the description, either. But it wouldn't be an issue if we support something like allowed_attributes.

SirbitoX avatar Aug 06 '15 16:08 SirbitoX