nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

Added new method process_text for extract text

Open Igor-Ivankovich opened this issue 2 years ago • 1 comments

Why was the new method implemented

In some cases, the HTML markup structure does not allow the text method to get data, similar to displaying them on the site.

Ex1:

doc = Nokogiri::HTML(<<EOT)
	<tr>
	    <th class="a-color-secondary a-size-base prodDetSectionEntry"> Best Sellers Rank</th>
	    <td><span>  <span>#133 in Video Games</span> <br>  <span>#2 in <a
	            href="/gp/bestsellers/videogames/402051011/ref=pd_zg_hrsr_videogames">PC Gaming<br>Keyboards</a></span> <br>  </span>
	    </td>
	</tr>
EOT

method text: doc.xpath("//tr/td").text.strip

output: "#133 in Video Games #2 in PC GamingKeyboards"

method process_text: doc.xpath("//tr/td").process_text

output: "#133 in Video Games #2 in PC Gaming Keyboards"

Ex2:

doc = Nokogiri::HTML(<<EOT)
    <div class="b-comment-reply__main">
        <div class="b-comment-reply__row">
            <div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
            </div>
        </div>
    </div>
EOT

method text: doc.xpath("//div[@class='b-comment-reply__body']/bbb").text.strip

output: "1 2 3 4"

method process_text: doc.xpath("//div[@class='b-comment-reply__body']/bbb").process_text

output: "1234"

For this reason, the method process_text was implemented

How the method works

The work of the method is to recursively pass from node_set to text.

In the nodeset and node classes, join by space is implemented:

result.join(' ')

In the text class, the removal of double spaces and spaces from thel eading and trailing whitespace:

self.text.gsub(/[[:space:]]+/, ' ').strip

Method process_text return nil if text not found.

The process_text! method was also implemented which returns raise if the text was not found

Error handler class:

  class NoFoundText < RuntimeError
    def message
      "Text wasn't found in the node"
    end
  end

Examples of using:

doc = Nokogiri::HTML(<<EOT)
    <div class="b-comment-reply__main">
        <div class="b-comment-reply__row">
            <div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
            </div>
        </div>
    </div>
EOT


# '0 node'
doc.xpath("//div[@class='b-comment-reply__body']/fff").process_text 
output: nil

doc.at("//div[@class='b-comment-reply__body']/fff")&.process_text
output: nil

# '0 node raise'
doc.xpath("//div[@class='b-comment-reply__body']/fff").process_text!
output: Text wasn't found in the node (Nokogiri::XML::NodeSet::NoFoundText)

# '1 node'
doc.xpath("//div[@class='b-comment-reply__body']").process_text
output: "1 2 3 4"
doc.at("//div[@class='b-comment-reply__body']")&.process_text
output: "1 2 3 4"

# '2 nodes'
doc.xpath("//div[@class='b-comment-reply__body']/bbb").process_text
output: "1 2 3 4"
doc.at("//div[@class='b-comment-reply__body']/bbb")&.process_text
output: "1 2 3"

Igor-Ivankovich avatar Jun 10 '22 11:06 Igor-Ivankovich

Hello! Thanks for submitting this idea for a new feature to Nokogiri.

I'm going to take some time to reflect on whether this functionality should be part of Nokogiri, or whether it better belongs in something like Loofah (which has a very similar method #to_text).

flavorjones avatar Jun 10 '22 12:06 flavorjones

Hi, I apologize for the long delay in replying.

Although this feature does seem useful for a specific use case, it doesn't seem like it's common enough to include into Nokogiri itself. I can easily imagine people wanting variations on this, like newlines instead of spaces, etc. In fact the Loofah gem was originally invented to allow people to customize string operations like this!

Perhaps you would consider packaging this as a gem that can do a variety of string operations similar to this on Nokogiri nodesets? Or submitting it to Loofah?

Thanks for suggesting this, and I do hope you consider contributing again in the future!

flavorjones avatar Aug 27 '22 01:08 flavorjones

Whitespace inside HTML tags is completely arbitrary, so cleanup should be default for text(), to mimic browsers. Only text inside <pre> tags, and possibly XML, should be returned verbatim, and this could have been auto-detected. For tags CSS-ed as preformatted (not detectable) one would have to pass eg. whitespace: true.

Yes, Loofah has santizing galore, but you want to keep the number of dependencies down, especially for things that you take for granted for a majority of use cases.

The suggested merge is small and makes sense. Vote to reconsider including this.

forthrin avatar Feb 24 '23 15:02 forthrin

@forthrin I tried your second example

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
    <div class="b-comment-reply__main">
        <div class="b-comment-reply__row">
            <div class="b-comment-reply__body"><bbb>1<bbb>2</bbb>3</bbb><bbb>4</bbb>
            </div>
        </div>
    </div>
EOT

puts(doc.xpath("//div[@class='b-comment-reply__body']/bbb").text.strip)

and it outputs 1234 without any spaces, as I'd expect. And it does so even if you remove the .strip.

I think in the first example, you're mistaken about what the browser is doing.

Here's the HTML from your example in a.html

<!DOCTYPE html>
<table>
	<tr>
	    <th class="a-color-secondary a-size-base prodDetSectionEntry"> Best Sellers Rank</th>
	    <td><span>  <span>#133 in Video Games</span> <br>  <span>#2 in <a
	            href="/gp/bestsellers/videogames/402051011/ref=pd_zg_hrsr_videogames">PC Gaming<br>Keyboards</a></span> <br>  </span>
	    </td>
	</tr>
</table>

Here's some Ruby code that gets the text from //tr/td.

require 'nokogiri'

doc = Nokogiri::HTML(File.read('a.html'))
puts("\"#{doc.xpath("//tr/td").text}\"")

Note that I removed strip and added quotation marks around the string. When I run the code, I get the following output.

"  #133 in Video Games   #2 in PC GamingKeyboards
	    "

If I open a.html in Firefox and ask for the textContent of the matching node, I get precisely the same output. Here's the Javascript console.

» document.evaluate('//tr/td', document, null, XPathResult.ANY_UNORDERED_NODE_TYPE, null).singleNodeValue.textContent
← "  #133 in Video Games   #2 in PC GamingKeyboards   
	      " 

In other words, I think Nokogiri's current behavior is identical to the browser's behavior.

stevecheckoway avatar Feb 24 '23 15:02 stevecheckoway

Thanks for your interest. I should point out I'm not the OP. I just started using Nokogiri and find myself constantly adding .gsub(/[[:space:]]+/, ' ').strip to every .text.

(Whenever cleaning text appears "unnecessary", it's temporal "luck" subject to change whenever the web server feels like spouting HTML differently.) Most would expect ...

doc = "<p>\n\x20Hello\n\x20World\n</p>"
printf "(%s)\n", Nokogiri.HTML5(doc).at_css('p').text`

... to return ...

(Hello World)

... and not ...

(
 Hello

 World
)

forthrin avatar Feb 24 '23 16:02 forthrin

@forthrin Sorry about that. I completely missed that you weren't the OP.

I tried out your example in my browser

<!DOCTYPE html>
<p>
 Hello
 World
</p>

And asked for the textContent of the p. Here's the JavaScript console

» document.querySelector('p').textContent
← "
   Hello
   World
  "

(JavaScript also has an innerText property that does what you want here, but innerText is dependent on the browser's rendering of the nodes in question which is something that Nokogiri really can't do. As a result, Nokogiri's #inner_text method is the same as its #text method and produce what JavaScript calls textContent.)

stevecheckoway avatar Feb 24 '23 16:02 stevecheckoway

To expand on that last point a bit, since I think something like innerText is what people are looking for here. The reason that this is impossible in Nokogiri is that it depends whether the node is going to be rendered at all (let's ignore that here) as well as how the node has been styled.

For example, if I change the p element to have the attribute style='white-space: pre' and then ask for the innerText, I get this.

» document.querySelector('p').innerText 
← "
   Hello
   World
  "

Nokogiri simply has no idea how the DOM tree it creates from the input HTML will be rendered and thus cannot support such transformations itself.

Using the Nokogiri API, you can perform the space compression/stripping. Here, for example, is the #compress_spaces method from the Squoosh gem.

    def compress_spaces(node)
      if node.text?
        if text_node_removable? node
          node.unlink
        else
          content = node.content
          content.gsub!(/[ \t\n\r\f]+/, ' ')
          content.lstrip! if trim_left? node
          content.rstrip! if trim_right? node
          node.content = content
        end
      elsif node.element? &&
            (node.name == 'pre' || node.name == 'textarea')
        # Leave the contents of these nodes alone.
      elsif normal_element?(node) || node.name == 'title'
        # Compress spaces in normal elements and title.
        node.children.each { |c| compress_spaces c }
      end
      nil
    end

I guess I could refactor that gem to expose this function, but using Squoosh will add dependencies on sassc, execjs, uglifier, and ffi, none of which are necessary for compressing spaces. Also, note that this code assumes that only pre and textarea elements have the white-space: pre style.

stevecheckoway avatar Feb 24 '23 16:02 stevecheckoway

I can't remember a single time I wanted to persist excess whitespace when scraping HTML. If cleaning whitespace has major implications like lots of added dependencies or piling on lots of other complex sanitizing to cover all bases, the pragmatic solution for me would be to simply override .text() to do what I expect.

forthrin avatar Feb 24 '23 18:02 forthrin

Just to be clear, the compress_spaces function (and the other functions it calls) don't require any extra dependencies. It's just that the Squoosh gem itself depends on other things for doing JavaScript and CSS minification which isn't relevant to this task. You could pretty easily extract the bit of code that compresses spaces if you wanted.

the pragmatic solution for me would be to simply override .text() to do what I expect

I don't have an opinion on overriding the method vs. adding a new one that does what you want, but I think doing it outside Nokogiri is the right thing to do since it'd be painful to try to support all possible ways of extracting text from the DOM.

For example, what do you expect to get as the text from<span>foo<br>bar</span>? Here are three possible answers:

  1. "foo\nbar". This is what JavaScript's innerText will give you (at least when the span is rendered with the default styling)
  2. "foo bar". I think the OP's code will give this
  3. "foobar". This is what JavaScript's textContent property and Nokogiri's #text method return.

One more example (and I'll stop here): <p>foo<span> X</span>bar</p> (note the space before the X). This will be rendered as foo Xbar (and is what both .innerText and .textContent give). The OP's code would give foo X bar. Stripping spaces of children before concatenating would give fooXbar. So in any event, you'll need to be careful with modifications you perform around spacing since there are lots of different ways you might want something to appear and unless you're very careful about all the corner cases (which Squoosh tries to be), it's easy to get it wrong.

stevecheckoway avatar Feb 24 '23 19:02 stevecheckoway

I appreciate you're giving this proper thought and considering edge cases. My default approach would be:

  1. Replace all tags except <br/> with whitespace
  2. Replace all Unicode whitespace with a single plain space
  3. Replace each <br/> with a newline
  4. Trim all left and right whitespace except newlines

So for the examples given, I would expect foo\nbar and foo X bar.

Now, it's impossible to determine if <br/> tags should be converted to newlines or spaces, as this depends on the content, layout, the author's intentions (and ultimately the scraper's intentions).<br/> is eval()'s even more evil cousin and an anomaly that should have been deprecated from the HTML standard along with <blink>. Whatever the author wishes to achieve with <br/> should be solved with good semantics and CSS.

Have a look at this HTML scraper. It's met expectations for years, though I'm sure you can break it if you set out to do so. It also acknowledges <br/> rogue nature by caving in with a DEFAULT_BR_TEXT setting.

https://sourceforge.net/p/simplehtmldom/repository/ci/master/tree/HtmlNode.php#l328

I am not concerned with what JavaScript does. My use case, and most people's I assume, is scraping web content, and you want that content to be clean, regardless of what sloppy street HTML appears in the wild.

If you're still conflicted, feel free to throw more dubious and ambigious examples at me. I used to work for a browser company where endless debates on subjects like this were as commonplace as drawing breath.

forthrin avatar Feb 25 '23 07:02 forthrin

We are not going to change the behavior of Node#text or any existing methods. Nokogiri is deployed far too widely to entertain a breaking change like this.

I'm open to adding a new method if it's clear what the behavior should be. I'm not at all sure that this is true in this case, people are going to have different opinions, and if that's not obvious just from the four of us involved in this issue, please search through Loofah's closed issues for to_text and see the variety of opinions there.

In the absence of a canonical implementation, extensibility is the right choice. And you have two choices!

  • Nokogiri offers up Document#decorators which are modules that every node in that document is extended with. So you are empowered to implement (or re-implement) your own methods however you wish.
  • Loofah is an extensible framework to manipulate Nokogiri documents.

flavorjones avatar Feb 27 '23 02:02 flavorjones

Backwards compatibility is paramount. Comments on the propositions in order of preference.

  • Configuration: Little code added. Easy to use for everyone.
  • Loofah: Fine if you don't care about dependencies, but sort of like shooting sparrows with a canon.
  • Decorators: A generous, but counterproductive offer. Every user must re-invent the wheel.

Keep me posted on any development. I'll be happy to test experimental "commit(ment)s from Sparkle Motion" ;)

forthrin avatar Feb 27 '23 05:02 forthrin