udata icon indicating copy to clipboard operation
udata copied to clipboard

Bleach domain parsing in linkify faulty for some emails

Open quaxsze opened this issue 5 years ago • 7 comments

The Bleach library used to sanatize markdown seems to parse and detect parts of domains name as link themselves.

Example: https://www.data.gouv.fr/fr/datasets/lignes-souterraines-du-reseau-rte-sur-le-territoire-de-la-mel/

Within the domain "rte-france.com", "-france.com" is seen as a valid link.

This will be fixed when https://github.com/mozilla/bleach/issues/60 is fixed, putting this on hold for now.

Reproduce inudata shell:

>>> html = '<p>Pour tous renseignements complémentaires sur ce jeu de données, écrivez à : [email protected]</p>\n'
>>> cleaner = bleach.Cleaner(
...     tags=current_app.config['MD_ALLOWED_TAGS'],
...     attributes=current_app.config['MD_ALLOWED_ATTRIBUTES'],
...     styles=current_app.config['MD_ALLOWED_STYLES'],
...     protocols=current_app.config['MD_ALLOWED_PROTOCOLS'],
...     strip_comments=False,
...     filters=[partial(LinkifyFilter, skip_tags=['pre'], parse_email=False,
...                         callbacks=callbacks)]
... )
>>> cleaner.clean(html)
'<p>Pour tous renseignements complémentaires sur ce jeu de données, écrivez à : rte-inspire-infos@rte<a href="http://-france.com">-france.com</a></p>\n'

quaxsze avatar Jan 22 '20 12:01 quaxsze

so the first thing I've noticed is the current version of Beach we use is the 3.1.0, we could try to upgrade to the current 3.1.5 I guess

JulienParis avatar May 29 '20 13:05 JulienParis

I think they tried to fix this behaviour in Bleach there : https://github.com/sedrubal/bleach/commit/b6537008a61bee98a03eda309e6d26f77af34f9b

JulienParis avatar May 29 '20 13:05 JulienParis

some issues for later readings :

  • https://github.com/mozilla/bleach/issues/60
  • https://github.com/mozilla/bleach/issues/300

JulienParis avatar May 29 '20 13:05 JulienParis

Seems relevant indeed :) . Can you try upgrading it localy?

quaxsze avatar May 29 '20 13:05 quaxsze

Seems relevant indeed :) . Can you try upgrading it localy?

I'm testing it locally as we speak ... Pedagogically speaking sounds fun, it could help understanding bit better the docker process

JulienParis avatar May 29 '20 13:05 JulienParis

I made a test page to check udata behaviour on various ways to write emails adresses...

I also referenced some new issues I discovered while debugging that topic, all that seemed to me somehow related to the way udata is bleaching the markdown contents (md -> html) : #2496 #2497

JulienParis avatar Jun 15 '20 11:06 JulienParis

Link with https://github.com/opendatateam/udata/issues/2498

ThibaudDauce avatar Apr 09 '24 08:04 ThibaudDauce