loofah allow custom scrubbers to leverage the HTML5lib scrubbing already written

A couple of commonly requested features:

add or remove attributes from the whitelists
turn off CSS scrubbing

Jan 28 '10 07:01 flavorjones

1 on this ticket / request. I wanted more custom control of my elements/attributes from the whitelist set and I had to achieve it like so:

http://gist.github.com/289027

Jan 28 '10 19:01 ruckus

I'm trying to find a good way to add to the whitelist attributes right now and am coming up empty on a straightforward way to monkeypatch. I just want to add a single element, but it seems excessive hard given the way that whitelist.rb declares the constants and then digetsts them permanently via the method in whitelist.rb such that I can't even seem to monkeypatch it.

May 27 '10 06:05 wbharding

I hear you! I'll be working on Loofah a bit over the next couple of weeks, and this will be one of the things I'll work on.

May 28 '10 13:05 flavorjones

fwiw, I did figure out how to monkeypatch it. Just add a new key/value to the HashedWhitelist. But of course it's always a tad nicer when one doesn't need to monkeypatch.

May 28 '10 16:05 wbharding

Any thoughts or progress on this? I need to add and remove some whitelist attributes.

Oct 25 '10 22:10 electrum

Just release 1.0.0, this is probably my next priority.

Any thoughts on what you think the API should look like to control whitelists?

Oct 26 '10 05:10 flavorjones

I have some almost complete work I've been doing on a whitelist for elements and attributes, just fyi (the usecase of valid with nested invalid with nested valid is broken still) https://github.com/bf4/Notes/blob/master/code/ruby/html_processing.rb when it's ready for a pull request, I'll do that. in the meantime, just an fyi

Mar 19 '12 00:03 bf4

It's worth noting that I've got a branch somewhere that I started, which implements a Rails-internals-compatible implementation of whitelists. This is so that, at some point, Loofah may be a pluggable sanitizer for any Rails app.

I should probably finish that up. ;)

Mar 20 '12 21:03 flavorjones

I still need to write a pull request, but the WhitelistTagScrubber really does work https://github.com/bf4/Notes/blob/loofah-testing/code/ruby/html_processing.rb

# usage
# all_attributes = ['id','class']
# tags_we_want =
#   {
#   'br' => [],
#   'ol' => all_attributes,
#   'ul' => all_attributes,
#   'li' => all_attributes,
#   'strong' => all_attributes,
#   'p' => all_attributes,
#   'i' => all_attributes,
#   'em' => all_attributes,
#   'a' => ['href','rel'].concat(all_attributes)
# }
# updater = CustomScrubber.new
# updater.clean_html(message_dirty, tags_we_want.keys, tags_we_want) do |html|
#      updater.line_breaks_to_br(html)
# end


class WhiteListTagScrubber < Loofah::Scrubber
  attr_reader :tags, :attributes
  def initialize(options = {}, &block)
    @tags = Array(options.delete(:tags))
    @attributes = options.delete(:attributes) || {}
    super(options, &block)
  end
  def debug(type,&block)
    if ENV['DEBUG'] =~ /true/i
      puts "**** #{type}, #{block.call.inspect}"
    end
  end
  def scrub(node)
    debug("processing") {  "#{node.type}: #{node.name}, namespaces #{node.namespaces.inspect}" }
    case node.type
    when Nokogiri::XML::Node::ELEMENT_NODE

      # see strip: return CONTINUE if html5lib_sanitize(node) == CONTINUE
      if tags.include? node.name
        # remove all attributes except the ones we whitelisted per tag
        clean_with_attributes(node,true)
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      else
        # remove all attributes
        clean_with_attributes(node,false)
        # remove the node and its contents entirely.
        # there's nothing good in these
        if %w{script style meta link}.include?(node.name)
          node.remove
        else
          # remove this undesired node and scrub each child node
          remove_node_and_add_children(node)
        end
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      end
    when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
      return Loofah::Scrubber::CONTINUE
    end
    node.remove
    Loofah::Scrubber::STOP
  end
  def remove_node_and_add_children(node)
    # alternatively see :strip
    # node.before node.children
    current_node = node
    node.children.each do |kid|
      previous_node = current_node
      current_node = current_node.add_next_sibling(kid)
      scrub(previous_node) unless previous_node == node
    end
    scrub(current_node) unless current_node == node
    node.remove
  end
  def clean_with_attributes(node,use_attributes=true)
    attr_array = use_attributes ? attributes[node.name] : nil
    node.attributes.each { |attr| node.remove_attribute(attr.first) unless Array(attr_array).include?(attr.first)}
  end
end

class CustomScrubber
  # uses Loofah
  def clean_html(html, tags=[],attributes={})

    yield Loofah.fragment(html).scrub!(scrub_tags_except(tags,attributes)).to_s

  end
  # perhaps also see the scrubber
  # :newline_block_elements
  def line_breaks_to_br(html)
    html.gsub(/\r?\n/,'<br>')
  end
  # tags in an array of tags
  # attributes is a hash of the previous tags with an array of their whitelisted attributes
  # needs to be DRYed
  def scrub_tags_except(tags,attributes)
    options = {:tags => tags, :attributes => attributes }
    WhiteListTagScrubber.new(options)
  end
end

Apr 09 '13 19:04 bf4

Curious, anything new on this issue? What's the current way of handling custom scrubbers? They seems a bit laborious (relative to how Sanitize handles custom configs), the solutions here.

Nov 25 '13 14:11 abitdodgy

:+1: Completely agree with @abitdodgy

Just take a look at how simple and straight forward this DSL is: https://github.com/rgrove/sanitize/blob/master/lib/sanitize/config/relaxed.rb

Having a means of being able to process something like that and perhaps even having additional regex on attribute values such as background src image, etc would be a big win. I would just use Sanitize, but seeing as this is getting merged in Rails 4.2 thought it would be a useful addition.

Oct 08 '14 18:10 saneshark

+1, would really like this feature.

Mar 19 '18 19:03 DanDevine

+1 too, 12 years later 🙁

Jan 05 '22 22:01 jemminger

@jemminger Please consider using https://github.com/rgrove/sanitize for a customizable sanitizer

Jan 06 '22 14:01 flavorjones

loofah loofah copied to clipboard

allow custom scrubbers to leverage the HTML5lib scrubbing already written

loofah
loofah copied to clipboard