cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

Pipeline (Tokenizer) has issues with non-UTF-8 characters

Open danyaljj opened this issue 7 years ago • 1 comments

I had experiences with tokenizer failing on non-UTF-8 characters. (e.g. "�" below):

val text = "Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode \"replacement character\" (U+FFFD, �), to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. The Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode Fallback font will display a box showing the hexadecimal scalar value of the character."

// AnnotationUtils.pipelineServerPOSTagger.annotate(text)  <---- doesn't work
val text2 = new String(text.getBytes("Windows-1252"), "UTF-8")
println(text2)
AnnotationUtils.pipelineServerPOSTagger.annotate(text2) // <----- this does work. 

danyaljj avatar Dec 18 '17 01:12 danyaljj

We have some cleanup code for this kind of problem: https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/TextCleanerStringTransformation.java https://github.com/CogComp/cogcomp-nlp/blob/master/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringTransformationCleanup.java If these don't cover such cases, this is where the fixes should be added. We could, by default, run some cleanup as part of the pipeline main(), but I'm open to suggestions.

mssammon avatar Apr 16 '18 18:04 mssammon