clojure-opennlp icon indicating copy to clipboard operation
clojure-opennlp copied to clipboard

The chunker needs punctuation to work properly

Open alexkalderimis opened this issue 10 years ago • 4 comments

Using the definitions of tokenize, pos-tag, and chunker from the readme, and 1.5.1 versions of the model files, the following behaviour is observed:

 (-> "I am looking for a good way to annotate this english text."
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"]  ["for"]  ["a" "good" "way"] ["to" "annotate"] ["this" "English" "text"]))

;; cf. the same operation, when the text is not full-stop terminated:
 (-> "I am looking for a good way to annotate this English text"
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English"])

The pos-tag output seems correct however.

alexkalderimis avatar Sep 13 '13 15:09 alexkalderimis

Yea, this is a known issue documented here: https://github.com/dakrone/clojure-opennlp#known-issues It's something that the OpenNLP libary does, not clojure-opennlp.

dakrone avatar Sep 13 '13 15:09 dakrone

Seems fair - thanks for the reply. And sorry for not spotting that disclaimer.

alexkalderimis avatar Sep 13 '13 15:09 alexkalderimis

Would be nice if you could report this to OpenNLP, so it can be fixed in the next version.

kottmann avatar Sep 18 '13 13:09 kottmann

I think the OpenNLP 1.7.2 version this project is using right now has fixed the punctuation problem. So maybe we can include the end punctuation?

Also, I notice the OpenNLP produce phrase tag as "O", where in the clojure-opennlp "O" is not incorporated.

wenxijuji avatar Jun 27 '17 21:06 wenxijuji