clojure-opennlp icon indicating copy to clipboard operation
clojure-opennlp copied to clipboard

How to deal with indeterminacy?

Open holtzermann17 opened this issue 8 years ago • 4 comments

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

holtzermann17 avatar Jan 14 '16 15:01 holtzermann17

OpenNLP allows to retrieve that information, I'd be happy to try and add it to the library, but I'm a Clojure newbie and can't be sure how much it'll take me :)

jacopofar avatar May 19 '16 15:05 jacopofar

@jacopofar that would be great if it could be added, perhaps on metadata? (or wherever it fits)

dakrone avatar May 19 '16 15:05 dakrone

I think should return a vector of parse trees along with their probabilities (currently it forces the value to 1).

I made a first attempt at allowing to specify this here but have yet to write a test for it

jacopofar avatar May 19 '16 16:05 jacopofar

Here's some code I added to one of my projects that gives access to the probabilities for part-of-speech tagging. I imagine something similar could be done for parsing.

(ns com.owoga.prhyme.util.nlp
  (:require [opennlp.nlp :as nlp]
            [opennlp.treebank :as tb]
            [clojure.string :as string]
            [clojure.java.io :as io]
            [clojure.zip :as zip]
            [com.owoga.prhyme.nlp.tag-sets.treebank-ii :as tb2])
  (:import (opennlp.tools.postag POSModel POSTaggerME)))

(def tokenize (nlp/make-tokenizer (io/resource "models/en-token.bin")))
(def get-sentences (nlp/make-sentence-detector (io/resource "models/en-sent.bin")))
(def parse (tb/make-treebank-parser (io/resource "models/en-parser-chunking.bin")))
(def pos-tagger (nlp/make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

;;;; The tagger that onennlp.nlp gives us doesn't provide access
;;;; to the probabilities of all tags. It gives us the probability of the
;;;; top tag through some metadata. But to get probs for all tags, we
;;;; need to do something like implement our own tagger.
(defprotocol Tagger
  (tags [this sent])
  (probs [this])
  (top-k-sequences [this sent]))

(defn make-pos-tagger
  [modelfile]
  (let [model (with-open [model-stream (io/input-stream modelfile)]
                (POSModel. model-stream))
        tagger (POSTaggerME. model)]
    (reify Tagger
      (tags [_ tokens]
        (let [token-array (into-array String tokens)]
          (map vector tokens (.tag tagger #^"[Ljava.lang.String;" token-array))))
      (probs [_] (seq (.probs tagger)))
      (top-k-sequences [_ tokens]
        (let [token-array (into-array String tokens)]
          (.topKSequences tagger #^"[Ljava.lang.String;" token-array))))))

(def prhyme-pos-tagger (make-pos-tagger (io/resource "models/en-pos-maxent.bin")))

(comment
  (let [phrase "The feeling hurts."]
    (map (juxt #(.getOutcomes %)
               #(map float (.getProbs %)))
         (top-k-sequences prhyme-pos-tagger (tokenize phrase))))
  ;; => ([["DT" "NN" "VBZ" "."] (0.9758878 0.93964833 0.7375927 0.95285994)]
  ;;     [["DT" "VBG" "VBZ" "."] (0.9758878 0.03690145 0.27251 0.9286113)])
  )

eihli avatar Nov 04 '20 19:11 eihli