loofah icon indicating copy to clipboard operation
loofah copied to clipboard

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers)

Open flavorjones opened this issue 1 year ago • 3 comments

From a slack thread at https://rubyonrails-link.slack.com/archives/C05054QPL/p1700056469860939

Has anyone found a way to use Nokogiri (or Loofah) to replace double-break tags with closing/opening paragraph tags? I have a lot of code in the database with this madness, and I would like to scrub it back out:

<p>Some text here in a logical paragraph.
  <br>
  <br>
  Some more text, apparently a second paragraph.
  <br>
  <br>
  Et cetera...
</p>

and I replied with:

#!/usr/bin/env ruby

require "nokogiri"

html = <<~HTML
<p>Some text here in a logical paragraph.
  <br>
  <br>
  Some more text, apparently a second paragraph.
  <br>
  <br>
  Et cetera...
</p>
<p>foo
  <br id=1>
  <br id=2>
  bar
  <br id=11>
  <br id=12>
  bar
</p>
<p>baz
  <br id=3>
</p>
<notp>foo
  <br id=4>
  <br id=5>
</notp>
HTML

doc = Nokogiri::HTML5::Document.parse(html)
puts doc.to_html

p_with_brs = doc.xpath(%q{//p[br[following-sibling::br]]})

p_with_brs.each do |p|
  new_p = p.add_previous_sibling("<p>").first

  # remove blank text nodes
  p.children.each do |c|
    c.unlink if c.text? && c.blank?
  end

  p.children.each do |c|
    next if c.parent.nil? # already unlinked
    if c.name == "br" && c.next_sibling.name == "br"
      new_p = p.add_previous_sibling("<p>").first
      c.next_sibling.unlink
      c.unlink
    else
      c.parent = new_p
    end
  end

  p.unlink
end

puts doc.to_html

which outputs:

<html><head></head><body><p>Some text here in a logical paragraph.
  </p><p>
  Some more text, apparently a second paragraph.
  </p><p>
  Et cetera...
</p>
<p>foo
  </p><p>
  bar
  </p><p>
  bar
</p>
<p>baz
  <br id="3">
</p>
<notp>foo
  <br id="4">
  <br id="5">
</notp>
</body></html>

I think this could be useful in a scrubber if it's something people commonly do.

cc @walterdavis

flavorjones avatar Dec 04 '23 22:12 flavorjones

@torihuang and @josecolella are working on this

josecolella avatar May 08 '24 15:05 josecolella

Our initial thoughts are that we should implement a new scrub ability like doc.scrub!(:breakpoint) which would removes all instances of <br>.

torihuang avatar May 08 '24 15:05 torihuang

This is the PR that should get us almost all the way there: https://github.com/flavorjones/loofah/pull/284

josecolella avatar May 08 '24 19:05 josecolella