NReadability
NReadability copied to clipboard
Only getting half the content
Trying NReadability on http://www.propertysearch4u.com.au/buyers-agent-sydney gives me just the following InnerHtml: http://pastebin.com/fuA2QJsH
The first line of content on the actual webpage is "The Buyers Agent Sydney Services that can help Sydneysiders, Interstate investors and Australian Expatriates cost effectively acquire their Sydney property." but the TranscodingResult.ExtractedContent I'm getting starts with "Bidding at auction can be intimidating." which is actually halfway down the page.
Here's how I'm calling it:
var transcodingInput = new WebTranscodingInput(strURL);
var transcoder = new NReadability.NReadabilityWebTranscoder();
var transcodingResult = transcoder.Transcode(transcodingInput);
if (!transcodingResult.ContentExtracted)
throw new ArgumentNullException();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(transcodingResult.ExtractedContent);
var bodyNode = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']");
A copy of the original page, in case it changes: http://pastebin.com/20nZ2qRa
The extraction algorithm in NReadability is by no means perfect. There will always be pages that get transcoded incorrectly.
Best Regards, Marek Stój
http://www.marekstoj.com http://www.devmedia.pl
On 15 April 2013 14:48, harvest316 [email protected] wrote:
Trying NReadability on http://www.propertysearch4u.com.au/buyers-agent-sydney gives me just the following InnerHtml: http://pastebin.com/fuA2QJsH
The first line of content on the actual webpage is "The Buyers Agent Sydney Services that can help Sydneysiders, Interstate investors and Australian Expatriates cost effectively acquire their Sydney property." but my TranscodingResult starts with "Bidding at auction can be intimidating." which is actually halfway down the page.
Here's how I'm calling it:
var transcodingInput = new WebTranscodingInput(strURL); var transcoder = new NReadability.NReadabilityWebTranscoder(); var transcodingResult = transcoder.Transcode(transcodingInput); if (!transcodingResult.ContentExtracted) throw new ArgumentNullException(); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(transcodingResult.ExtractedContent); var bodyNode = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']");
A copy of the original page, in case it changes: http://pastebin.com/20nZ2qRa
— Reply to this email directly or view it on GitHubhttps://github.com/marek-stoj/NReadability/issues/16 .