cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

Evaluating tokenizer?

Open danyaljj opened this issue 7 years ago • 5 comments

@mssammon We briefly discussed having tests / evaluations for tokenizer. Thoughts how hard/easy that might be? If we have the data, I can have a look.

danyaljj avatar Jun 19 '17 19:06 danyaljj

The reason that I'm asking for this is that @Slash0BZ is trying to apply some fixes and I want to make sure we're not breaking anything.

danyaljj avatar Jun 26 '17 18:06 danyaljj

The last version of tokenizer that had tests -- for which we used the MASC corpus -- is here: https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-tokenizer/tree/master

mssammon avatar Jun 26 '17 19:06 mssammon

@mssammon Is this data public, or proprietary?

@Slash0BZ could you monitor the progress on this data, while you're fixing the tokenizer issues you had?

danyaljj avatar Jun 26 '17 19:06 danyaljj

My current approach is adding more exceptions in TokenizerStateMachine at the part where it checks if a "." character means the end of a sentence. I will monitor this progress as mentioned above.

Slash0BZ avatar Jun 26 '17 22:06 Slash0BZ

On a related note: You can try if using the AnnotatorFixer for ACE helps in correcting sentence boundaries. You can use the Entity view for fixing sentence boundaries. Just realized @mssammon had added this as part of XMLTextAnnotation changes. Might help.

bhargav avatar Jun 28 '17 04:06 bhargav