boilerpipe
boilerpipe copied to clipboard
Incorrect characters in Extractor output
I have a one-liner trying to extract a hungarian site with special charaters
like "ő" "ű".
Command line query is this:
# java de/l3s/boilerpipe/demo/ExtractMe
http://sportgeza.hu/2012/london/cikkek/nem_schmitt_pal_hagyta_jova_a_rossz_himnu
szt
And here's my code:
# cat de/l3s/boilerpipe/demo/ExtractMe.java
package de.l3s.boilerpipe.demo;
import java.net.URL;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class ExtractMe {
public static void main(final String[] args) throws Exception {
final URL url = new URL(args[0]);
System.out.println(ArticleExtractor.INSTANCE.getText(url));
}
}
*(partial) Extracted content:
... megfelel? himnuszt játszák a magyar gy?ztesek tiszteletére, akikb?l
remélik, hogy minél több lesz...
In the extracted text "?"-s should be "ő" characters, but in the end of the
extraction, all I get is 3F in hexa, which is the question mark.
I'm under
#uname FreeBSD pdfgen 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49
UTC 2010 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
# java -version
java version "1.6.0_07"
Diablo Java(TM) SE Runtime Environment (build 1.6.0_07-b02)
Diablo Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
Been working on a solution for days, but I can't seem to find a reason why it
wouldn't work :/
BTW, curl outputs characters beautifully when called on an UTF-8 terminal,
but boilerpipe fails to display even those special characters, which were good
at first.
I'd appreciate any help/ideas, best
M
Original issue reported on code.google.com by [email protected]
on 31 Jul 2012 at 3:38
Hello, did you manage to solve it on your own?
Original comment by [email protected]
on 10 Sep 2012 at 4:08
Hello, not really. I use php to analyze the output of boilerpipe, and estimate
the charset, but the ideal case would be if I wouldn't have to do that.
I found a shell wrapper for boilerpipe though which seemed to work:
https://github.com/theneubeck/boilerpipe-server
It didn't fit my needs so I decided to use a php middle layer, but some might
find it helpful.
Original comment by [email protected]
on 10 Sep 2012 at 6:36
Found the solution:
Here is the java code needed to fix the special charaters issue:
public class ExtractMe {
public static void main(final String[] args) throws Exception {
BufferedReader in = new BufferedReader(new
InputStreamReader(System.in,"UTF-8"));
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(ArticleExtractor.INSTANCE.getText(in));
}
}
Original comment by [email protected]
on 18 Sep 2013 at 1:20