loofah
loofah copied to clipboard
allow custom scrubbers to leverage the HTML5lib scrubbing already written
A couple of commonly requested features:
- add or remove attributes from the whitelists
- turn off CSS scrubbing
- 1 on this ticket / request. I wanted more custom control of my elements/attributes from the whitelist set and I had to achieve it like so:
http://gist.github.com/289027
I'm trying to find a good way to add to the whitelist attributes right now and am coming up empty on a straightforward way to monkeypatch. I just want to add a single element, but it seems excessive hard given the way that whitelist.rb declares the constants and then digetsts them permanently via the method in whitelist.rb such that I can't even seem to monkeypatch it.
I hear you! I'll be working on Loofah a bit over the next couple of weeks, and this will be one of the things I'll work on.
fwiw, I did figure out how to monkeypatch it. Just add a new key/value to the HashedWhitelist. But of course it's always a tad nicer when one doesn't need to monkeypatch.
Any thoughts or progress on this? I need to add and remove some whitelist attributes.
Just release 1.0.0, this is probably my next priority.
Any thoughts on what you think the API should look like to control whitelists?
I have some almost complete work I've been doing on a whitelist for elements and attributes, just fyi (the usecase of valid with nested invalid with nested valid is broken still) https://github.com/bf4/Notes/blob/master/code/ruby/html_processing.rb when it's ready for a pull request, I'll do that. in the meantime, just an fyi
It's worth noting that I've got a branch somewhere that I started, which implements a Rails-internals-compatible implementation of whitelists. This is so that, at some point, Loofah may be a pluggable sanitizer for any Rails app.
I should probably finish that up. ;)
I still need to write a pull request, but the WhitelistTagScrubber really does work https://github.com/bf4/Notes/blob/loofah-testing/code/ruby/html_processing.rb
# usage
# all_attributes = ['id','class']
# tags_we_want =
# {
# 'br' => [],
# 'ol' => all_attributes,
# 'ul' => all_attributes,
# 'li' => all_attributes,
# 'strong' => all_attributes,
# 'p' => all_attributes,
# 'i' => all_attributes,
# 'em' => all_attributes,
# 'a' => ['href','rel'].concat(all_attributes)
# }
# updater = CustomScrubber.new
# updater.clean_html(message_dirty, tags_we_want.keys, tags_we_want) do |html|
# updater.line_breaks_to_br(html)
# end
class WhiteListTagScrubber < Loofah::Scrubber
attr_reader :tags, :attributes
def initialize(options = {}, &block)
@tags = Array(options.delete(:tags))
@attributes = options.delete(:attributes) || {}
super(options, &block)
end
def debug(type,&block)
if ENV['DEBUG'] =~ /true/i
puts "**** #{type}, #{block.call.inspect}"
end
end
def scrub(node)
debug("processing") { "#{node.type}: #{node.name}, namespaces #{node.namespaces.inspect}" }
case node.type
when Nokogiri::XML::Node::ELEMENT_NODE
# see strip: return CONTINUE if html5lib_sanitize(node) == CONTINUE
if tags.include? node.name
# remove all attributes except the ones we whitelisted per tag
clean_with_attributes(node,true)
return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
else
# remove all attributes
clean_with_attributes(node,false)
# remove the node and its contents entirely.
# there's nothing good in these
if %w{script style meta link}.include?(node.name)
node.remove
else
# remove this undesired node and scrub each child node
remove_node_and_add_children(node)
end
return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
end
when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
return Loofah::Scrubber::CONTINUE
end
node.remove
Loofah::Scrubber::STOP
end
def remove_node_and_add_children(node)
# alternatively see :strip
# node.before node.children
current_node = node
node.children.each do |kid|
previous_node = current_node
current_node = current_node.add_next_sibling(kid)
scrub(previous_node) unless previous_node == node
end
scrub(current_node) unless current_node == node
node.remove
end
def clean_with_attributes(node,use_attributes=true)
attr_array = use_attributes ? attributes[node.name] : nil
node.attributes.each { |attr| node.remove_attribute(attr.first) unless Array(attr_array).include?(attr.first)}
end
end
class CustomScrubber
# uses Loofah
def clean_html(html, tags=[],attributes={})
yield Loofah.fragment(html).scrub!(scrub_tags_except(tags,attributes)).to_s
end
# perhaps also see the scrubber
# :newline_block_elements
def line_breaks_to_br(html)
html.gsub(/\r?\n/,'<br>')
end
# tags in an array of tags
# attributes is a hash of the previous tags with an array of their whitelisted attributes
# needs to be DRYed
def scrub_tags_except(tags,attributes)
options = {:tags => tags, :attributes => attributes }
WhiteListTagScrubber.new(options)
end
end
Curious, anything new on this issue? What's the current way of handling custom scrubbers? They seems a bit laborious (relative to how Sanitize handles custom configs), the solutions here.
:+1: Completely agree with @abitdodgy
Just take a look at how simple and straight forward this DSL is: https://github.com/rgrove/sanitize/blob/master/lib/sanitize/config/relaxed.rb
Having a means of being able to process something like that and perhaps even having additional regex on attribute values such as background src image, etc would be a big win. I would just use Sanitize, but seeing as this is getting merged in Rails 4.2 thought it would be a useful addition.
+1, would really like this feature.
+1 too, 12 years later 🙁
@jemminger Please consider using https://github.com/rgrove/sanitize for a customizable sanitizer