dkpro-wsd icon indicating copy to clipboard operation
dkpro-wsd copied to clipboard

MASCReader returns empty CASes

Open logological opened this issue 10 years ago • 2 comments

Originally reported on Google Code with ID 65

MASCReader generates an empty CAS for some files in the corpus

Reported by MedKhemakhemFSEGS on 2014-12-03 18:54:40


- _Attachment: [mascModified.groovy](https://storage.googleapis.com/google-code-attachments/dkpro-wsd/issue-65/comment-0/mascModified.groovy)_

logological avatar Jun 24 '15 15:06 logological

I reproduced this problem.

For the MASC sentence corpus, the MASCReader returns 1865 empty Cas's and 13754 normal
Cas's

I copied a version of the MASCReader into a local project and run the following pipeline:

String patterns = "round*/*-v/*-wn.xml";
        SimplePipeline.runPipeline(
                createReaderDescription(
                        MascReader.class,
                        MascReader.PARAM_IGNORE_TIES, true,
                        MascReader.PARAM_SOURCE_LOCATION, MASCDirectory,
                        MascReader.PARAM_PATTERNS,  new String[] {
                                ResourceCollectionReaderBase.INCLUDE_PREFIX + patterns
}),
                createEngineDescription(LanguageToolSegmenter.class),
                createEngineDescription(MascProblemFinder.class)
                //createEngineDescription(CasDumpWriter.class)
                );

I modified the MASCReader to return a sentence instead of an empty Cas: this is where
the problem is introduced:

        // if no tie between annotators is discovered
        if (documentText != null) {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText(documentText);
        }
        else {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText("This is an empty Cas.");

            //jCas.reset(); // TODO here the CAS is emptied
        }


Reported by eckle.kohler on 2014-12-08 20:45:13

logological avatar Jun 24 '15 15:06 logological

I don't recall much about the MASC corpus format, so I don't have much context to help
me interpret this problem report.  I take it from reading the code that the empty CAS
was returned only in those cases where there was a tie between the annotators.  Is
this perhaps the intended behaviour?  If not, is your modified code above intended
to fix the problem?

Reported by [email protected] on 2014-12-11 14:48:43

logological avatar Jun 24 '15 15:06 logological