Stanford.NLP.NET icon indicating copy to clipboard operation
Stanford.NLP.NET copied to clipboard

Tests for other Langauges

Open GeorgeS2019 opened this issue 3 years ago • 3 comments

Currently going through Parser using model 4.5.1 version provided for German

//ParserTests.cs
[Test]
public void ParseEasySentence()
{
     //All steps prior to this work!

     var gs = gsf.newGrammaticalStructure(parse);
}
java.lang.IllegalArgumentException: 'No head rule defined for NUR using class edu.stanford.nlp.trees.UniversalSemanticHeadFinder in (NUR
  (S (PROPN Christian) (AUX ist)
    (NP (PRON mein) (NOUN Freund)))
  (PUNCT .))

Potentially relevant issue: No head rule defined for IP using class edu.stanford.nlp.trees.SemanticHeadFinder

GeorgeS2019 avatar Sep 29 '22 10:09 GeorgeS2019

https://github.com/stanfordnlp/CoreNLP/issues/1227

I know it is not part of the scope. It would be great if you could get the German language using e.g. the following example.

public class TestSatzErkennung
{

	public static String text = "Marie was born in Paris. Marie wurde in Paris geboren.";

	public static void main(String[] args) 
	{
		// set up pipeline properties
		Properties props = new Properties();
		// set the list of annotators to run
//		props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");//"tokenize,ssplit,pos,lemma");
//		props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/german-ud.tagger");
//		props.setProperty("tokenize.language", "German");
//		props.setProperty("ner.model", "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz");
		
		props.setProperty("annotators" ," tokenize, ssplit, mwt, pos, ner, depparse");
		props.setProperty("tokenize.language" , "de");
		props.setProperty("tokenize.postProcessor" , "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor");

		props.setProperty("mwt.mappingFile" , "edu/stanford/nlp/models/mwt/german/german-mwt.tsv");

		props.setProperty("pos.model" , "edu/stanford/nlp/models/pos-tagger/german-ud.tagger");

		props.setProperty("ner.model" , "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz");
		props.setProperty("ner.applyNumericClassifiers" , "false");
		props.setProperty("ner.applyFineGrained" , "false");
		props.setProperty("ner.useSUTime" , "false");

		props.setProperty("parse.model" , "edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz");
		props.setProperty("depparse.model" , "edu/stanford/nlp/models/parser/nndep/UD_German.gz");
		// build pipeline
		StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
		// create a document object
		CoreDocument document = pipeline.processToCoreDocument(text);
		
		for(CoreSentence sentence : document.sentences())
		{
			System.out.println(sentence);
			
			// display tokens
			for (CoreLabel tok : sentence.tokens()) 
			{
				System.out.println(String.format("%s\t%s\t%s\t%s\t%b", tok.word(), tok.lemma(), tok.tag(), tok.ner(), tok.isMWT()));
			}
			
			for(SemanticGraphEdge s : sentence.dependencyParse().edgeIterable())
			{
				System.out.println(s);
			}
		}
	}
}

GeorgeS2019 avatar Sep 29 '22 18:09 GeorgeS2019

I am happy to merge test that check that German works as expected, especially if you have working sample.

sergey-tihon avatar Sep 29 '22 20:09 sergey-tihon

@sergey-tihon Good to hear that. Searching the internet, most users complained about the German language (especially the dependency parsing, which is the most critical as OpenNLP has no such features), most likely the least tested, so it is good we have you a second look :-)

GeorgeS2019 avatar Sep 29 '22 20:09 GeorgeS2019

This is solved now

GeorgeS2019 avatar Feb 29 '24 16:02 GeorgeS2019