clojure-opennlp
clojure-opennlp copied to clipboard
How to deal with indeterminacy?
Evaluating (treebank-parser ["What can happen in a second ."])
using the set-up in the README file here, I get the following parse:
(TOP
(SBARQ
(WHNP (WP What))
(SQ
(VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
(. .)))
Actually I'm pretty sure the JJ
should be an NN
. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?
OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :)
@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits)
I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1).
I made a first attempt at allowing to specify this here but have yet to write a test for it
Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing.
(ns com.owoga.prhyme.util.nlp
(:require [opennlp.nlp :as nlp]
[opennlp.treebank :as tb]
[clojure.string :as string]
[clojure.java.io :as io]
[clojure.zip :as zip]
[com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
(:import (opennlp.tools.postag POSModel POSTaggerME)))
(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
(tags [this sent])
(probs [this])
(top-k-sequences [this sent]))
(defn make-pos-tagger
[modelfile]
(let [model (with-open [model-stream (io/input-stream modelfile)]
(POSModel. model-stream))
tagger (POSTaggerME. model)]
(reify Tagger
(tags [_ tokens]
(let [token-array (into-array String tokens)]
(map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
(probs [_] (seq (.probs tagger)))
(top-k-sequences [_ tokens]
(let [token-array (into-array String tokens)]
(.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))
(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))
(comment
(let [phrase "The feeling hurts."]
(map (juxt #(.getOutcomes %)
#(map float (.getProbs %)))
(top-k-sequences prhyme-pos-tagger (tokenize phrase))))
;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
;; [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])
)