CoreNLP
CoreNLP copied to clipboard
Question: Best practices for converting OntoNotes to UD
What are the current best practices for converting OntoNotes 5.0 to UD format? I didn't find any documentation or issues about this, sorry if it was already asked. I used this description of EWT conversion as a basic guidance.
There are multiple preprocessors:
edu.stanford.nlp.trees.treebank.OntoNotesUDUpdaterIt seems to filter many broken sentences (around 17k).- Also, I found a common tool for correction of Penn Treebanks in
edu.stanford.nlp.trees.Treebanks. Does it make sense to invoke this function afterOntoNotesUDUpdater? - Anything else?
After that I apply:
edu.stanford.nlp.trees.ud.UniversalDependenciesConverteredu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator
The following fields are filled after that: FORM, LEMMA, UPOSTAG, FEATS, HEAD, DEPREL. I didn't find a tool to add original sentence text to the final Conllu file, and information about token spacing. Any clues for these ones? I found scripts that were used to add SpaceAfter to EWT, but it seems it cannot be applied to OntoNotes.
Postprocessing:
- There is
UniversalEnhancerthat can be used for any language. Can I use pretrained fasttext embeddings in this tool? Or do I need some special embeddings? - anything else?
example of a script:
#!/usr/bin/env bash
convert (){
local fname="$1"
local part=${fname#onto.}
for f in $(<$fname) ; do
rm -f onto_fixed temp_tree temp_ud
if [ -n "$MK_CRCT" ]; then
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.treebank.OntoNotesUDUpdater \
$f > onto_fixed 2>> "$OUT_DIR"/fixer.log
f=onto_fixed
fi
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.Treebanks \
-correct -pennPrint $f \
> temp_tree 2>> "$OUT_DIR"/correct.log
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesConverter \
-outputRepresentation enhanced++ -treeFile temp_tree \
> temp_ud 2>> "$OUT_DIR"/convert-1.log
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator \
temp_ud temp_tree \
>> "$OUT_DIR"/$part.conllu 2>> "$OUT_DIR"/convert-2.log
done
# see https://github.com/stanfordnlp/CoreNLP/issues/1132
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalEnhancer \
-conlluFile "$OUT_DIR"/$part.conllu \
-relativePronouns "that|which|who|whom|whose|where|That|Which|Who|Whom|Whose|Where" \
> "$OUT_DIR"/$part.conllu.enhanced 2> "$OUT_DIR"/enhance.log
rm "$OUT_DIR"/$part.conllu && mv "$OUT_DIR"/$part.conllu.enhanced "$OUT_DIR"/$part.conllu
}
[ -z "$ONTO_DIR" ] && ONTO_DIR="/path/to/onto"
[ -z "$CORENLP_HOME" ] && CORENLP_HOME="/path/to/corenlp"
OUT_DIR="$1"
if [ -z "$OUT_DIR" ]; then
echo "Pass out_dir as first argument"
exit 3
fi
mkdir -p "$OUT_DIR"
#creaet abs path
OUT_DIR=$(cd "$1"; pwd)
rm -f "$OUT_DIR"/*.conllu
MK_CRCT="$2"
echo "Convert to $OUT_DIR with MK_CRCT=$MK_CRCT"
pushd "$ONTO_DIR"/data/files/data/english/annotations
find . -name *.parse > onto
java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.parser.tools.OntoNotesFilePreparation onto
convert onto.train
convert onto.dev
convert onto.test
popd
The PTB corrector was only intended for the PTB, not OntoNotes. You could always try diffing the two lines to see if there is any difference, and if so, if it's a beneficial difference. In some cases, the errors corrected may have been universal, and in others they were very specific to mislabeled PTB trees.
I don't believe there's a way to include any of the useful metadata, such as sentence number, original text, etc. I don't envision being able to extract SpaceAfter in a way that is guaranteed to be correct, since the space information is lost when the text was tokenized and turned into trees, but you may be able to get most of the way there with some general heuristics. Without that, of course, the original text annotation would not be correct either.
If you don't provide any embeddings, it should work fine. It should also work fine with any embeddings you provide.
One thing to note is that there have been a ton of updates to the lemmas in the UD EWT dataset. With that in mind, you may want to review some of the lemmas produced by this process before assuming they are correct. Ideally the lemmatizer would have had some of these lemma fixes included, but that hasn't happened yet
Thank you for your response! I saw those great changes in the UD EWT. I guess it was done with some bash and manual checking. We can try to replicate these corrections but given the size of Ontonotes it can be a bit difficult.
Ideally the lemmatizer would have had some of these lemma fixes included
It would be awesome!
I have since updated the lemmatizer to incorporate many of the fixes in EWT, although it is still not 100% the same