clojure-opennlp
clojure-opennlp copied to clipboard
The chunker needs punctuation to work properly
Using the definitions of tokenize
, pos-tag
, and chunker
from the readme, and 1.5.1 versions of the model files, the following behaviour is observed:
(-> "I am looking for a good way to annotate this english text."
tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English" "text"]))
;; cf. the same operation, when the text is not full-stop terminated:
(-> "I am looking for a good way to annotate this English text"
tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English"])
The pos-tag output seems correct however.
Yea, this is a known issue documented here: https://github.com/dakrone/clojure-opennlp#known-issues It's something that the OpenNLP libary does, not clojure-opennlp.
Seems fair - thanks for the reply. And sorry for not spotting that disclaimer.
Would be nice if you could report this to OpenNLP, so it can be fixed in the next version.
I think the OpenNLP 1.7.2 version this project is using right now has fixed the punctuation problem. So maybe we can include the end punctuation?
Also, I notice the OpenNLP produce phrase tag as "O", where in the clojure-opennlp "O" is not incorporated.