dkpro-wsd
dkpro-wsd copied to clipboard
MASCReader returns empty CASes
Originally reported on Google Code with ID 65
MASCReader generates an empty CAS for some files in the corpus
Reported by MedKhemakhemFSEGS on 2014-12-03 18:54:40
- _Attachment: [mascModified.groovy](https://storage.googleapis.com/google-code-attachments/dkpro-wsd/issue-65/comment-0/mascModified.groovy)_
I reproduced this problem.
For the MASC sentence corpus, the MASCReader returns 1865 empty Cas's and 13754 normal
Cas's
I copied a version of the MASCReader into a local project and run the following pipeline:
String patterns = "round*/*-v/*-wn.xml";
SimplePipeline.runPipeline(
createReaderDescription(
MascReader.class,
MascReader.PARAM_IGNORE_TIES, true,
MascReader.PARAM_SOURCE_LOCATION, MASCDirectory,
MascReader.PARAM_PATTERNS, new String[] {
ResourceCollectionReaderBase.INCLUDE_PREFIX + patterns
}),
createEngineDescription(LanguageToolSegmenter.class),
createEngineDescription(MascProblemFinder.class)
//createEngineDescription(CasDumpWriter.class)
);
I modified the MASCReader to return a sentence instead of an empty Cas: this is where
the problem is introduced:
// if no tie between annotators is discovered
if (documentText != null) {
setDocumentMetadata(jCas, node);
jCas.setDocumentText(documentText);
}
else {
setDocumentMetadata(jCas, node);
jCas.setDocumentText("This is an empty Cas.");
//jCas.reset(); // TODO here the CAS is emptied
}
Reported by eckle.kohler on 2014-12-08 20:45:13
I don't recall much about the MASC corpus format, so I don't have much context to help
me interpret this problem report. I take it from reading the code that the empty CAS
was returned only in those cases where there was a tie between the annotators. Is
this perhaps the intended behaviour? If not, is your modified code above intended
to fix the problem?
Reported by [email protected] on 2014-12-11 14:48:43