NReadability icon indicating copy to clipboard operation
NReadability copied to clipboard

Only getting half the content

Open harvest316 opened this issue 11 years ago • 1 comments

Trying NReadability on http://www.propertysearch4u.com.au/buyers-agent-sydney gives me just the following InnerHtml: http://pastebin.com/fuA2QJsH

The first line of content on the actual webpage is "The Buyers Agent Sydney Services that can help Sydneysiders, Interstate investors and Australian Expatriates cost effectively acquire their Sydney property." but the TranscodingResult.ExtractedContent I'm getting starts with "Bidding at auction can be intimidating." which is actually halfway down the page.

Here's how I'm calling it:

var transcodingInput = new WebTranscodingInput(strURL);
var transcoder = new NReadability.NReadabilityWebTranscoder();
var transcodingResult = transcoder.Transcode(transcodingInput);
if (!transcodingResult.ContentExtracted)
    throw new ArgumentNullException();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(transcodingResult.ExtractedContent);
var bodyNode = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']");

A copy of the original page, in case it changes: http://pastebin.com/20nZ2qRa

harvest316 avatar Apr 15 '13 12:04 harvest316

The extraction algorithm in NReadability is by no means perfect. There will always be pages that get transcoded incorrectly.

Best Regards, Marek Stój

http://www.marekstoj.com http://www.devmedia.pl

On 15 April 2013 14:48, harvest316 [email protected] wrote:

Trying NReadability on http://www.propertysearch4u.com.au/buyers-agent-sydney gives me just the following InnerHtml: http://pastebin.com/fuA2QJsH

The first line of content on the actual webpage is "The Buyers Agent Sydney Services that can help Sydneysiders, Interstate investors and Australian Expatriates cost effectively acquire their Sydney property." but my TranscodingResult starts with "Bidding at auction can be intimidating." which is actually halfway down the page.

Here's how I'm calling it:

var transcodingInput = new WebTranscodingInput(strURL); var transcoder = new NReadability.NReadabilityWebTranscoder(); var transcodingResult = transcoder.Transcode(transcodingInput); if (!transcodingResult.ContentExtracted) throw new ArgumentNullException(); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(transcodingResult.ExtractedContent); var bodyNode = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']");

A copy of the original page, in case it changes: http://pastebin.com/20nZ2qRa

— Reply to this email directly or view it on GitHubhttps://github.com/marek-stoj/NReadability/issues/16 .

marek-stoj avatar Apr 16 '13 04:04 marek-stoj