cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

Tokenizer breaks on empty strings, carriage returns ...

Open nitishgupta opened this issue 9 years ago • 13 comments

I experienced that our latest tokenizer breaks on empty string "" and "\n" at the least. It produces an 'ArrayOutOfBound Exception'

nitishgupta avatar Nov 21 '16 04:11 nitishgupta

So as you can see in #282, I added tests which contain empty string and new line, and the tests pass. So unless you give an explicit way to reproduce the exception, we won't be able to solver this.

danyaljj avatar Nov 23 '16 00:11 danyaljj

From #282 I learned that creating TextAnnotation using TextAnnotationBuilder fails on string that are empty, just carriage return, space etc. Code snippet that fails and gives 'java.lang.ArrayIndexOutOfBoundsException: -1' :

String text = " "; // or "\n" or "" 
TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new IllinoisTokenizer()); 
TextAnnotation ta = tab.createTextAnnotation(text);

nitishgupta avatar Nov 23 '16 06:11 nitishgupta

Thanks Daniel and Mark!

nitishgupta avatar Nov 25 '16 07:11 nitishgupta

Re-opening the issue since apparently it still fails when receiving carriage returns (according to @nitishgupta )

danyaljj avatar Jul 10 '18 16:07 danyaljj

This fails when I do this in Python. I haven't checked in Java recently. Could ccg_nlpy be using an old version?

nitishgupta avatar Jul 10 '18 17:07 nitishgupta

It is using a slightly older version; but very recent: https://github.com/CogComp/cogcomp-nlpy/blob/9e733c0d936cbd923a9e6702e36bc6d64ab887ad/ccg_nlpy/download.py#L14

Is there a way to verify that it is a purely a Java issue (and Python side is not to blame)?

danyaljj avatar Jul 10 '18 18:07 danyaljj

I can do that and update here.

nitishgupta avatar Jul 10 '18 18:07 nitishgupta

Space and new-line character both break the TextAnnotationBuilder.

TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer());
tab.createTextAnnotation("", "", "\n");
tab.createTextAnnotation("", "", " ");

Error code:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
	at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:154)
	at edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent.<init>(Constituent.java:108)
	at edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView.addSpanLabel(SpanLabelView.java:95)
	at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation.<init>(TextAnnotation.java:88)
	at edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder.createTextAnnotation(TokenizerTextAnnotationBuilder.java:139)

nitishgupta avatar Aug 06 '18 20:08 nitishgupta

I don't think a TextAnnotation object should be created for such cases, because no annotations can be generated for them (at least, not with our current tooling). However, the failure should result in an explicit exception to this effect rather than resulting in a bounds exception.

mssammon avatar Aug 06 '18 20:08 mssammon

@nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?

mssammon avatar Aug 06 '18 20:08 mssammon

Having a TextAnnotation for these edge cases reduces the burden on the user to write checking-code. IMO, a TextAnnotation should be able to be created for any String object. Is that not the general consensus?

On Mon, Aug 6, 2018 at 4:39 PM Mark Sammons [email protected] wrote:

@nitishgupta https://github.com/nitishgupta what is the use case? would a TextAnnotation with no non-whitespace text be useful in some way?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410845845, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bUKYHfQFz1q8gr075pUqXcmIXetwks5uOKmQgaJpZM4K3z_v .

nitishgupta avatar Aug 06 '18 20:08 nitishgupta

I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?

mssammon avatar Aug 06 '18 20:08 mssammon

Okay. I get your point. Thanks for the time.

On Mon, Aug 6, 2018 at 4:49 PM Mark Sammons [email protected] wrote:

I disagree. User of the resulting TextAnnotation will likely have to check for empty views -- or, the unlucky client who is using their code as an intermediary will. How is this better?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CogComp/cogcomp-nlp/issues/278#issuecomment-410848394, or mute the thread https://github.com/notifications/unsubscribe-auth/AF71bdE6YperycHTeLa6ac2dO6xvWAVgks5uOKvFgaJpZM4K3z_v .

nitishgupta avatar Aug 06 '18 20:08 nitishgupta