cogcomp-nlp
cogcomp-nlp copied to clipboard
Pipeline (Tokenizer) has issues with non-UTF-8 characters
I had experiences with tokenizer failing on non-UTF-8 characters. (e.g. "�" below):
val text = "Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode \"replacement character\" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. The Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode Fallback font will display a box showing the hexadecimal scalar value of the character."
// AnnotationUtils.pipelineServerPOSTagger.annotate(text) <---- doesn't work
val text2 = new String(text.getBytes("Windows-1252"), "UTF-8")
println(text2)
AnnotationUtils.pipelineServerPOSTagger.annotate(text2) // <----- this does work.
We have some cleanup code for this kind of problem: https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/TextCleanerStringTransformation.java https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringTransformationCleanup.java If these don't cover such cases, this is where the fixes should be added. We could, by default, run some cleanup as part of the pipeline main(), but I'm open to suggestions.