udata
udata copied to clipboard
Bleach domain parsing in linkify faulty for some emails
The Bleach library used to sanatize markdown seems to parse and detect parts of domains name as link themselves.
Example: https://www.data.gouv.fr/fr/datasets/lignes-souterraines-du-reseau-rte-sur-le-territoire-de-la-mel/
Within the domain "rte-france.com", "-france.com" is seen as a valid link.
This will be fixed when https://github.com/mozilla/bleach/issues/60 is fixed, putting this on hold for now.
Reproduce inudata shell
:
>>> html = '<p>Pour tous renseignements complémentaires sur ce jeu de données, écrivez à : [email protected]</p>\n'
>>> cleaner = bleach.Cleaner(
... tags=current_app.config['MD_ALLOWED_TAGS'],
... attributes=current_app.config['MD_ALLOWED_ATTRIBUTES'],
... styles=current_app.config['MD_ALLOWED_STYLES'],
... protocols=current_app.config['MD_ALLOWED_PROTOCOLS'],
... strip_comments=False,
... filters=[partial(LinkifyFilter, skip_tags=['pre'], parse_email=False,
... callbacks=callbacks)]
... )
>>> cleaner.clean(html)
'<p>Pour tous renseignements complémentaires sur ce jeu de données, écrivez à : rte-inspire-infos@rte<a href="http://-france.com">-france.com</a></p>\n'
so the first thing I've noticed is the current version of Beach we use is the 3.1.0
, we could try to upgrade to the current 3.1.5
I guess
I think they tried to fix this behaviour in Bleach there : https://github.com/sedrubal/bleach/commit/b6537008a61bee98a03eda309e6d26f77af34f9b
some issues for later readings :
- https://github.com/mozilla/bleach/issues/60
- https://github.com/mozilla/bleach/issues/300
Seems relevant indeed :) . Can you try upgrading it localy?
Seems relevant indeed :) . Can you try upgrading it localy?
I'm testing it locally as we speak ... Pedagogically speaking sounds fun, it could help understanding bit better the docker process
I made a test page to check udata behaviour on various ways to write emails adresses...
I also referenced some new issues I discovered while debugging that topic, all that seemed to me somehow related to the way udata is bleaching the markdown contents (md -> html) : #2496 #2497
Link with https://github.com/opendatateam/udata/issues/2498