CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

Question: Best practices for converting OntoNotes to UD

Open boyboytemp opened this issue 4 years ago • 3 comments

What are the current best practices for converting OntoNotes 5.0 to UD format? I didn't find any documentation or issues about this, sorry if it was already asked. I used this description of EWT conversion as a basic guidance.

There are multiple preprocessors:

  • edu.stanford.nlp.trees.treebank.OntoNotesUDUpdater It seems to filter many broken sentences (around 17k).
  • Also, I found a common tool for correction of Penn Treebanks in edu.stanford.nlp.trees.Treebanks. Does it make sense to invoke this function after OntoNotesUDUpdater?
  • Anything else?

After that I apply:

  • edu.stanford.nlp.trees.ud.UniversalDependenciesConverter
  • edu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator

The following fields are filled after that: FORM, LEMMA, UPOSTAG, FEATS, HEAD, DEPREL. I didn't find a tool to add original sentence text to the final Conllu file, and information about token spacing. Any clues for these ones? I found scripts that were used to add SpaceAfter to EWT, but it seems it cannot be applied to OntoNotes.

Postprocessing:

  • There is UniversalEnhancer that can be used for any language. Can I use pretrained fasttext embeddings in this tool? Or do I need some special embeddings?
  • anything else?
example of a script:
#!/usr/bin/env bash

convert (){
  local fname="$1"
  local part=${fname#onto.}
  for f in $(<$fname) ; do
      rm -f onto_fixed temp_tree temp_ud

      if [ -n "$MK_CRCT" ]; then
          java -cp "$CORENLP_HOME/*"  -mx5g edu.stanford.nlp.trees.treebank.OntoNotesUDUpdater \
               $f > onto_fixed 2>> "$OUT_DIR"/fixer.log
          f=onto_fixed
      fi

      java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.Treebanks \
           -correct -pennPrint $f \
           > temp_tree 2>> "$OUT_DIR"/correct.log
      java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesConverter \
           -outputRepresentation enhanced++ -treeFile temp_tree \
           > temp_ud 2>> "$OUT_DIR"/convert-1.log

      java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator \
           temp_ud temp_tree \
           >> "$OUT_DIR"/$part.conllu 2>> "$OUT_DIR"/convert-2.log

  done


  # see https://github.com/stanfordnlp/CoreNLP/issues/1132
  java -cp "$CORENLP_HOME/*"  -mx5g edu.stanford.nlp.trees.ud.UniversalEnhancer \
       -conlluFile "$OUT_DIR"/$part.conllu \
       -relativePronouns "that|which|who|whom|whose|where|That|Which|Who|Whom|Whose|Where" \
       > "$OUT_DIR"/$part.conllu.enhanced 2> "$OUT_DIR"/enhance.log
  rm "$OUT_DIR"/$part.conllu && mv "$OUT_DIR"/$part.conllu.enhanced "$OUT_DIR"/$part.conllu

}

[ -z "$ONTO_DIR" ] && ONTO_DIR="/path/to/onto"
[ -z "$CORENLP_HOME" ] && CORENLP_HOME="/path/to/corenlp"

OUT_DIR="$1"
if [ -z "$OUT_DIR" ]; then
  echo "Pass out_dir as first argument"
  exit 3
fi

mkdir -p "$OUT_DIR"
#creaet abs path
OUT_DIR=$(cd "$1"; pwd)
rm -f "$OUT_DIR"/*.conllu

MK_CRCT="$2"

echo "Convert to $OUT_DIR with MK_CRCT=$MK_CRCT"

pushd "$ONTO_DIR"/data/files/data/english/annotations

find . -name *.parse > onto
java -cp "$CORENLP_HOME/*"  -mx5g edu.stanford.nlp.parser.tools.OntoNotesFilePreparation onto

convert onto.train
convert onto.dev
convert onto.test


popd

boyboytemp avatar Aug 30 '21 14:08 boyboytemp

The PTB corrector was only intended for the PTB, not OntoNotes. You could always try diffing the two lines to see if there is any difference, and if so, if it's a beneficial difference. In some cases, the errors corrected may have been universal, and in others they were very specific to mislabeled PTB trees.

I don't believe there's a way to include any of the useful metadata, such as sentence number, original text, etc. I don't envision being able to extract SpaceAfter in a way that is guaranteed to be correct, since the space information is lost when the text was tokenized and turned into trees, but you may be able to get most of the way there with some general heuristics. Without that, of course, the original text annotation would not be correct either.

If you don't provide any embeddings, it should work fine. It should also work fine with any embeddings you provide.

One thing to note is that there have been a ton of updates to the lemmas in the UD EWT dataset. With that in mind, you may want to review some of the lemmas produced by this process before assuming they are correct. Ideally the lemmatizer would have had some of these lemma fixes included, but that hasn't happened yet

AngledLuffa avatar Aug 31 '21 06:08 AngledLuffa

Thank you for your response! I saw those great changes in the UD EWT. I guess it was done with some bash and manual checking. We can try to replicate these corrections but given the size of Ontonotes it can be a bit difficult.

Ideally the lemmatizer would have had some of these lemma fixes included

It would be awesome!

boyboytemp avatar Sep 01 '21 12:09 boyboytemp

I have since updated the lemmatizer to incorporate many of the fixes in EWT, although it is still not 100% the same

AngledLuffa avatar Jul 08 '22 06:07 AngledLuffa