sanitize
sanitize copied to clipboard
URL fragment identifiers containing colons are stripped even when relative URLs are allowed
Using the Sanitize gem, I'm cleaning some HTML. In the href attribute of my anchor tags, I wish to parse the following:
<a href="#fn:1">1</a>
This is required for implementing footnotes using the Kramdown gem.
However, Sanitize doesn't appear to like the colon inside the href attribute. It simply outputs <a>1</a>
instead, skipping the href attribute altogether.
My sanitize code looks like this:
# Setup whitelist of html elements, attributes, and protocols that are allowed.
allowed_elements = ['h2', 'a', 'img', 'p', 'ul', 'ol', 'li', 'strong', 'em', 'cite',
'blockquote', 'code', 'pre', 'dl', 'dt', 'dd', 'br', 'hr', 'sup', 'div']
allowed_attributes = {'a' => ['href', 'rel', 'rev'], 'img' => ['src', 'alt'],
'sup' => ['id'], 'div' => ['class'], 'li' => ['id']}
allowed_protocols = {'a' => {'href' => ['http', 'https', 'mailto', :relative]}}
# Clean text of any unwanted html tags.
html = Sanitize.clean(html, :elements => allowed_elements, :attributes => allowed_attributes,
:protocols => allowed_protocols)
Is there a way to get Sanitize to accept a colon in the href attribute?
This issue is a duplicate of this Stack Overflow question.
Answered on Stack Overflow. Repeating here for posterity.
This is Sanitize doing the safest thing by default. It assumes that the portion of the URL before the :
is a protocol (or a scheme in the terminology of RFC 1738), and since #fn
isn't in the protocol whitelist, the entire href
attribute is removed.
You can allow URLs like this by adding #fn
to the protocol whitelist:
allowed_protocols = {'a' => {'href' => ['#fn', 'http', 'https', 'mailto', :relative]}}
I'm found this while troubleshooting an issue where Sanitize strips out the :
character in the href
tag. I have a document with a bookmark that contains the :
character and an href
that points to it (e.g.href="#my:id
). Seeing as :
is a valid character for id
in HTML5 would it be safe for Sanitize to leave the :
in place for links that begin with a #
character?