loofah
loofah copied to clipboard
feat: encapsulate some whitespace-handling into a scrubber (or scrubbers)
From a slack thread at https://rubyonrails-link.slack.com/archives/C05054QPL/p1700056469860939
Has anyone found a way to use Nokogiri (or Loofah) to replace double-break tags with closing/opening paragraph tags? I have a lot of code in the database with this madness, and I would like to scrub it back out:
<p>Some text here in a logical paragraph. <br> <br> Some more text, apparently a second paragraph. <br> <br> Et cetera... </p>
and I replied with:
#!/usr/bin/env ruby
require "nokogiri"
html = <<~HTML
<p>Some text here in a logical paragraph.
<br>
<br>
Some more text, apparently a second paragraph.
<br>
<br>
Et cetera...
</p>
<p>foo
<br id=1>
<br id=2>
bar
<br id=11>
<br id=12>
bar
</p>
<p>baz
<br id=3>
</p>
<notp>foo
<br id=4>
<br id=5>
</notp>
HTML
doc = Nokogiri::HTML5::Document.parse(html)
puts doc.to_html
p_with_brs = doc.xpath(%q{//p[br[following-sibling::br]]})
p_with_brs.each do |p|
new_p = p.add_previous_sibling("<p>").first
# remove blank text nodes
p.children.each do |c|
c.unlink if c.text? && c.blank?
end
p.children.each do |c|
next if c.parent.nil? # already unlinked
if c.name == "br" && c.next_sibling.name == "br"
new_p = p.add_previous_sibling("<p>").first
c.next_sibling.unlink
c.unlink
else
c.parent = new_p
end
end
p.unlink
end
puts doc.to_html
which outputs:
<html><head></head><body><p>Some text here in a logical paragraph.
</p><p>
Some more text, apparently a second paragraph.
</p><p>
Et cetera...
</p>
<p>foo
</p><p>
bar
</p><p>
bar
</p>
<p>baz
<br id="3">
</p>
<notp>foo
<br id="4">
<br id="5">
</notp>
</body></html>
I think this could be useful in a scrubber if it's something people commonly do.
cc @walterdavis
@torihuang and @josecolella are working on this
Our initial thoughts are that we should implement a new scrub ability like doc.scrub!(:breakpoint)
which would removes all instances of <br>
.
This is the PR that should get us almost all the way there: https://github.com/flavorjones/loofah/pull/284