clojure-opennlp copied to clipboard
How to deal with indeterminacy?
Evaluating (treebank-parser ["What can happen in a second ."])
using the set-up in the README file here, I get the following parse:
(WHNP (WP What))
(VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
(. .)))
Actually I'm pretty sure the JJ
should be an NN
. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?
OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :)
@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits)
I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1).
I made a first attempt at allowing to specify this here but have yet to write a test for it
Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing.
(ns com.owoga.prhyme.util.nlp
(:require [opennlp.nlp :as nlp]
[opennlp.treebank :as tb]
[clojure.string :as string]
[ :as io]
[ :as zip]
[com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
(:import ( POSModel POSTaggerME)))
(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
(tags [this sent])
(probs [this])
(top-k-sequences [this sent]))
(defn make-pos-tagger
(let [model (with-open [model-stream (io/input-stream modelfile)]
(POSModel. model-stream))
tagger (POSTaggerME. model)]
(reify Tagger
(tags [_ tokens]
(let [token-array (into-array String tokens)]
(map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
(probs [_] (seq (.probs tagger)))
(top-k-sequences [_ tokens]
(let [token-array (into-array String tokens)]
(.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))
(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
(let [phrase "The feeling hurts."]
(map (juxt #(.getOutcomes %)
#(map float (.getProbs %)))
(top-k-sequences prhyme-pos-tagger (tokenize phrase))))
;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
;; [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])