Google Code Exporter
Google Code Exporter
``` Any change on this issue? I am seeing the same thing with parsing NYT pages for my application. I think this might be related to the fact that NYT...
``` ..and I'm using boilerpipe 1.2.0 ``` Original comment by `[email protected]` on 31 Jul 2012 at 3:39
``` Hello, did you manage to solve it on your own? ``` Original comment by `[email protected]` on 10 Sep 2012 at 4:08
``` Hello, not really. I use php to analyze the output of boilerpipe, and estimate the charset, but the ideal case would be if I wouldn't have to do that....
``` Found the solution: Here is the java code needed to fix the special charaters issue: public class ExtractMe { public static void main(final String[] args) throws Exception { BufferedReader...
``` Thanks for reporting. This seems to be caused by a bug in NekoHTML 1.9.13 The corresponding stacktrace points at "org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)" The problem seems to go away after an update...
``` Thanks for quick-response. As you've stated, the problem has gone away with NekoHTML 1.9.15. Below is the list of changes in NekoHTML since ver.1.9.13 (which has been released on...
``` It looks like the issue is the KeepLargestBlockFilter which rejects every block except the largest. While taking out this filter in the library should return results closer to http://boilerpipe-web.appspot.com/,...
``` I'm also unable to get the same results using the HTMLHighlighter in extraction mode. The web API (http://boilerpipe-web.appspot.com) clearly states that: "This Web Application probably uses a more recent...
``` I got the similar issue. When trying the URL "http://www.hokkaido-np.co.jp/news/donai/424760.html" With ArticleExtractor and "Plain Text" output Library code did not produce same results as http://boilerpipe-web.appspot.com/ ``` Original comment by...