pismo
pismo copied to clipboard
Default reader gets wrong content
Ran into an issue with Pismo's default reader returning the wrong section of an HTML document for its body
/html_body
fields. It does work, however, with the cluster reader. This might be a good addition to the test corpus for the default reader.
http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story
The default reader seems to pull content from <div id="navbar">
rather than <div id="content">
.
>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story")
>> doc.body
=> "* The T\n* Casinos\n* News by neighborhood\n* Crime\n* Fires\n* Boston Store\n* Photos\n* Boston English\n* Restrooms\n* Blogs"
>> doc = Pismo::Document.new("http://www.universalhub.com/2013/touchy-tabloid-tries-wreck-globe-story", :reader => :cluster)
>> doc.body
=> "Sour grapes at the Herald? With bonus gratuitous quote from some lawyer making accusations with no apparent facts behind them:\nIf he was a reporter on deadline and he's distracted and making phone calls and texting, then that's something that adds to his fault. You're not supposed to be distracted in a cab, you're supposed to focus fully on your job,\" said Douglas Sheff, a Boston personal injury lawyer and president-elect of the Massachusetts Bar Association.\nDoes the esquire have any proof the reporter was on deadline and making phone calls and texting right before the crash? If so, he and the Herald failed to produce it."
(Originally reported in feedbin/support#35)