nikkou
nikkou copied to clipboard
Extract useful data from HTML and XML with ease!
Nikkou
Extract useful data from HTML and XML with ease!
Description
Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with Mechanize.
Installation
Add Nikkou to your Gemfile:
gem 'nikkou'
Method Overview
Here's a summary of the methods Nikkou provides (see "Methods" for details):
Formatting
parse_text - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet
time(options={}) - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a time_zone
option
url(attribute='href') - Converts the href (or other specified attribute) into an absolute URL using the document's URI; <a href="/p/1">Link</a>
yields http://mysite.com/p/1
Searching
attr_equals(attribute, string) - Finds nodes where the attribute equals the string
attr_includes(attribute, string) - Finds nodes where the attribute includes the string
attr_matches(attribute, pattern) - Finds nodes where the attribute matches the pattern
*drill(methods) - Nil-safe method chaining
find(path) - Same as search
but returns the first matched node
text_equals(string) - Finds nodes where the text equals the string
text_includes(string) - Finds nodes where the text includes the string
text_matches(pattern) - Finds nodes where the text matches the pattern
Methods
Formatting
time(options={})
Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.
# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time
Options
attribute
The attribute to parse:
# <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
doc.search('a').first.time(attribute: 'data-published-at')
time_zone
The document's time zone (the time will be converted from that to UTC):
# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time(time_zone: 'America/New_York')
url(attribute='href')
Returns an absolute URL; useful for parsing relative hrefs. The document's uri
needs to be set for Nikkou to know what domain to add to relative paths.
# <a href="/p/1">My link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"
If Mechanize is being used, the uri
doesn't need to be manually set.
Options
attribute
The attribute to parse:
# <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"
Searching
attr_equals(attribute, string)
Selects nodes where the specified attribute equals the string.
# <div data-type="news">My Text</div>
doc.attr_equals('data-type', 'news').first.text # "My Text"
attr_includes(attribute, string)
Selects nodes where the specified attribute includes the string.
# <div data-type="major-news">My Text</div>
doc.attr_includes('data-type', 'news').first.text # "My Text"
attr_matches(attribute, pattern)
Selects nodes with an attribute matching a pattern. The pattern's matches are available in Node#matches
.
# <span data-tooltip="3 Comments">My Text</span>
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]
drill(*methods)
Nil-safe method chaining. Replaces this:
node = doc.find('.count')
if node
attribute = node.attr('data-count')
if attribute
return attribute.to_i
end
end
With this:
return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)
find(path)
Same as search
, but returns the first matched node. Replaces this:
nodes = node.search('h4')
if nodes
return nodes.first
end
With this:
return node.find('h4')
text_includes(string)
Selects nodes where the text includes the string.
# <div data-type="news">My Text</div>
doc.text_includes('Text').first.text # "My Text"
text_matches(pattern)
Selects nodes with text matching a pattern. The pattern's matches are available in Node#matches
.
# <a href="/p/1">3 Comments</a>
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]
License
Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.