nokogiri
nokogiri copied to clipboard
Added new method process_text for extract text
Why was the new method implemented
In some cases, the HTML markup structure does not allow the text method to get data, similar to displaying them on the site.
Ex1:
doc = Nokogiri::HTML(<<EOT)
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry"> Best Sellers Rank</th>
<td><span> <span>#133 in Video Games</span> <br> <span>#2 in <a
href="/gp/bestsellers/videogames/402051011/ref=pd_zg_hrsr_videogames">PC Gaming<br>Keyboards</a></span> <br> </span>
</td>
</tr>
EOT
method text:
doc.xpath("//tr/td").text.strip
output:
"#133 in Video Games #2 in PC GamingKeyboards"
method process_text:
doc.xpath("//tr/td").process_text
output:
"#133 in Video Games #2 in PC Gaming Keyboards"
Ex2:
doc = Nokogiri::HTML(<<EOT)
<div class="b-comment-reply__main">
<div class="b-comment-reply__row">
<div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
</div>
</div>
</div>
EOT
method text:
doc.xpath("//div[@class='b-comment-reply__body']/bbb").text.strip
output:
"1 2 3 4"
method process_text:
doc.xpath("//div[@class='b-comment-reply__body']/bbb").process_text
output:
"1234"
For this reason, the method process_text was implemented
How the method works
The work of the method is to recursively pass from node_set to text.
In the nodeset and node classes, join by space is implemented:
result.join(' ')
In the text class, the removal of double spaces and spaces from thel eading and trailing whitespace:
self.text.gsub(/[[:space:]]+/, ' ').strip
Method process_text return nil if text not found.
The process_text! method was also implemented which returns raise if the text was not found
Error handler class:
class NoFoundText < RuntimeError
def message
"Text wasn't found in the node"
end
end
Examples of using:
doc = Nokogiri::HTML(<<EOT)
<div class="b-comment-reply__main">
<div class="b-comment-reply__row">
<div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
</div>
</div>
</div>
EOT
# '0 node'
doc.xpath("//div[@class='b-comment-reply__body']/fff").process_text
output: nil
doc.at("//div[@class='b-comment-reply__body']/fff")&.process_text
output: nil
# '0 node raise'
doc.xpath("//div[@class='b-comment-reply__body']/fff").process_text!
output: Text wasn't found in the node (Nokogiri::XML::NodeSet::NoFoundText)
# '1 node'
doc.xpath("//div[@class='b-comment-reply__body']").process_text
output: "1 2 3 4"
doc.at("//div[@class='b-comment-reply__body']")&.process_text
output: "1 2 3 4"
# '2 nodes'
doc.xpath("//div[@class='b-comment-reply__body']/bbb").process_text
output: "1 2 3 4"
doc.at("//div[@class='b-comment-reply__body']/bbb")&.process_text
output: "1 2 3"
Hello! Thanks for submitting this idea for a new feature to Nokogiri.
I'm going to take some time to reflect on whether this functionality should be part of Nokogiri, or whether it better belongs in something like Loofah (which has a very similar method #to_text
).
Hi, I apologize for the long delay in replying.
Although this feature does seem useful for a specific use case, it doesn't seem like it's common enough to include into Nokogiri itself. I can easily imagine people wanting variations on this, like newlines instead of spaces, etc. In fact the Loofah gem was originally invented to allow people to customize string operations like this!
Perhaps you would consider packaging this as a gem that can do a variety of string operations similar to this on Nokogiri nodesets? Or submitting it to Loofah?
Thanks for suggesting this, and I do hope you consider contributing again in the future!
Whitespace inside HTML tags is completely arbitrary, so cleanup should be default for text()
, to mimic browsers. Only text inside <pre>
tags, and possibly XML, should be returned verbatim, and this could have been auto-detected. For tags CSS-ed as preformatted (not detectable) one would have to pass eg. whitespace: true
.
Yes, Loofah has santizing galore, but you want to keep the number of dependencies down, especially for things that you take for granted for a majority of use cases.
The suggested merge is small and makes sense. Vote to reconsider including this.
@forthrin I tried your second example
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="b-comment-reply__main">
<div class="b-comment-reply__row">
<div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
</div>
</div>
</div>
EOT
puts(doc.xpath("//div[@class='b-comment-reply__body']/bbb").text.strip)
and it outputs 1234
without any spaces, as I'd expect. And it does so even if you remove the .strip
.
I think in the first example, you're mistaken about what the browser is doing.
Here's the HTML from your example in a.html
<!DOCTYPE html>
<table>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry"> Best Sellers Rank</th>
<td><span> <span>#133 in Video Games</span> <br> <span>#2 in <a
href="/gp/bestsellers/videogames/402051011/ref=pd_zg_hrsr_videogames">PC Gaming<br>Keyboards</a></span> <br> </span>
</td>
</tr>
</table>
Here's some Ruby code that gets the text from //tr/td
.
require 'nokogiri'
doc = Nokogiri::HTML(File.read('a.html'))
puts("\"#{doc.xpath("//tr/td").text}\"")
Note that I removed strip
and added quotation marks around the string. When I run the code, I get the following output.
" #133 in Video Games #2 in PC GamingKeyboards
"
If I open a.html
in Firefox and ask for the textContent
of the matching node, I get precisely the same output. Here's the Javascript console.
» document.evaluate('//tr/td', document, null, XPathResult.ANY_UNORDERED_NODE_TYPE, null).singleNodeValue.textContent
← " #133 in Video Games #2 in PC GamingKeyboards
"
In other words, I think Nokogiri's current behavior is identical to the browser's behavior.
Thanks for your interest. I should point out I'm not the OP. I just started using Nokogiri and find myself constantly adding .gsub(/[[:space:]]+/, ' ').strip
to every .text
.
(Whenever cleaning text appears "unnecessary", it's temporal "luck" subject to change whenever the web server feels like spouting HTML differently.) Most would expect ...
doc = "<p>\n\x20Hello\n\x20World\n</p>"
printf "(%s)\n", Nokogiri.HTML5(doc).at_css('p').text`
... to return ...
(Hello World)
... and not ...
(
Hello
World
)
@forthrin Sorry about that. I completely missed that you weren't the OP.
I tried out your example in my browser
<!DOCTYPE html>
<p>
Hello
World
</p>
And asked for the textContent of the p
. Here's the JavaScript console
» document.querySelector('p').textContent
← "
Hello
World
"
(JavaScript also has an innerText
property that does what you want here, but innerText
is dependent on the browser's rendering of the nodes in question which is something that Nokogiri really can't do. As a result, Nokogiri's #inner_text
method is the same as its #text
method and produce what JavaScript calls textContent
.)
To expand on that last point a bit, since I think something like innerText
is what people are looking for here. The reason that this is impossible in Nokogiri is that it depends whether the node is going to be rendered at all (let's ignore that here) as well as how the node has been styled.
For example, if I change the p
element to have the attribute style='white-space: pre'
and then ask for the innerText
, I get this.
» document.querySelector('p').innerText
← "
Hello
World
"
Nokogiri simply has no idea how the DOM tree it creates from the input HTML will be rendered and thus cannot support such transformations itself.
Using the Nokogiri API, you can perform the space compression/stripping. Here, for example, is the #compress_spaces
method from the Squoosh gem.
def compress_spaces(node)
if node.text?
if text_node_removable? node
node.unlink
else
content = node.content
content.gsub!(/[ \t\n\r\f]+/, ' ')
content.lstrip! if trim_left? node
content.rstrip! if trim_right? node
node.content = content
end
elsif node.element? &&
(node.name == 'pre' || node.name == 'textarea')
# Leave the contents of these nodes alone.
elsif normal_element?(node) || node.name == 'title'
# Compress spaces in normal elements and title.
node.children.each { |c| compress_spaces c }
end
nil
end
I guess I could refactor that gem to expose this function, but using Squoosh will add dependencies on sassc, execjs, uglifier, and ffi, none of which are necessary for compressing spaces. Also, note that this code assumes that only pre
and textarea
elements have the white-space: pre
style.
I can't remember a single time I wanted to persist excess whitespace when scraping HTML. If cleaning whitespace has major implications like lots of added dependencies or piling on lots of other complex sanitizing to cover all bases, the pragmatic solution for me would be to simply override .text()
to do what I expect.
Just to be clear, the compress_spaces
function (and the other functions it calls) don't require any extra dependencies. It's just that the Squoosh gem itself depends on other things for doing JavaScript and CSS minification which isn't relevant to this task. You could pretty easily extract the bit of code that compresses spaces if you wanted.
the pragmatic solution for me would be to simply override
.text()
to do what I expect
I don't have an opinion on overriding the method vs. adding a new one that does what you want, but I think doing it outside Nokogiri is the right thing to do since it'd be painful to try to support all possible ways of extracting text from the DOM.
For example, what do you expect to get as the text from<span>foo<br>bar</span>
? Here are three possible answers:
-
"foo\nbar"
. This is what JavaScript'sinnerText
will give you (at least when thespan
is rendered with the default styling) -
"foo bar"
. I think the OP's code will give this -
"foobar"
. This is what JavaScript'stextContent
property and Nokogiri's#text
method return.
One more example (and I'll stop here): <p>foo<span> X</span>bar</p>
(note the space before the X). This will be rendered as foo Xbar
(and is what both .innerText
and .textContent
give). The OP's code would give foo X bar
. Stripping spaces of children before concatenating would give fooXbar
. So in any event, you'll need to be careful with modifications you perform around spacing since there are lots of different ways you might want something to appear and unless you're very careful about all the corner cases (which Squoosh tries to be), it's easy to get it wrong.
I appreciate you're giving this proper thought and considering edge cases. My default approach would be:
- Replace all tags except
<br/>
with whitespace - Replace all Unicode whitespace with a single plain space
- Replace each
<br/>
with a newline - Trim all left and right whitespace except newlines
So for the examples given, I would expect foo\nbar
and foo X bar
.
Now, it's impossible to determine if <br/>
tags should be converted to newlines or spaces, as this depends on the content, layout, the author's intentions (and ultimately the scraper's intentions).<br/>
is eval()
's even more evil cousin and an anomaly that should have been deprecated from the HTML standard along with <blink>
. Whatever the author wishes to achieve with <br/>
should be solved with good semantics and CSS.
Have a look at this HTML scraper. It's met expectations for years, though I'm sure you can break it if you set out to do so. It also acknowledges <br/>
rogue nature by caving in with a DEFAULT_BR_TEXT
setting.
https://sourceforge.net/p/simplehtmldom/repository/ci/master/tree/HtmlNode.php#l328
I am not concerned with what JavaScript does. My use case, and most people's I assume, is scraping web content, and you want that content to be clean, regardless of what sloppy street HTML appears in the wild.
If you're still conflicted, feel free to throw more dubious and ambigious examples at me. I used to work for a browser company where endless debates on subjects like this were as commonplace as drawing breath.
We are not going to change the behavior of Node#text
or any existing methods. Nokogiri is deployed far too widely to entertain a breaking change like this.
I'm open to adding a new method if it's clear what the behavior should be. I'm not at all sure that this is true in this case, people are going to have different opinions, and if that's not obvious just from the four of us involved in this issue, please search through Loofah's closed issues for to_text
and see the variety of opinions there.
In the absence of a canonical implementation, extensibility is the right choice. And you have two choices!
- Nokogiri offers up
Document#decorators
which are modules that every node in that document isextend
ed with. So you are empowered to implement (or re-implement) your own methods however you wish. - Loofah is an extensible framework to manipulate Nokogiri documents.
Backwards compatibility is paramount. Comments on the propositions in order of preference.
- Configuration: Little code added. Easy to use for everyone.
- Loofah: Fine if you don't care about dependencies, but sort of like shooting sparrows with a canon.
- Decorators: A generous, but counterproductive offer. Every user must re-invent the wheel.
Keep me posted on any development. I'll be happy to test experimental "commit(ment)s from Sparkle Motion" ;)