cogcomp-nlp icon indicating copy to clipboard operation
cogcomp-nlp copied to clipboard

Corpusreader for TAC dataset - need usage instructions

Open lashmore opened this issue 2 years ago • 0 comments

It is very difficult to intuitively understand how the TACReader class is meant to be used. What path do I send to "corpusRoot"? Here is the file hierarchy of the raw TAC 2014-2015 data, where 2015 has a similar folder structure to 2014.

From what I can tell, TACReader is breaking down XML documents. The only folder containing XML data is in source_documents. Inside the .txt files is XML file structure. Is TACReader ONLY parsing information from source_documents, or does it parse from other folders in the file structure?

Screen Shot 2021-08-30 at 2 50 10 PM

Here's how I'm trying to use TACReader and here's the error message I'm getting. Note, I've tried a bunch of different paths to set corpusRoot at, and they're all giving me the same error. I'm running completely blind here. Any help would be very appreciated!

import edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader;

public class PreprocessTAC {
    public static void main(String[] args) throws Exception {
        String path = "/path/to/tac_kbp_eng_event_arg_comp_train_eval_2014-2015/data/";
        TACReader reader_tac = new TACReader(path, false);
    }
}

Error message:

Exception in thread "main" java.lang.NullPointerException: Cannot read the array length because "<local4>" is null
	at edu.illinois.cs.cogcomp.core.io.IOUtils.lsFilesRecursive(IOUtils.java:145)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.getFileListing(TACReader.java:239)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.initializeReader(XmlDocumentReader.java:107)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader.<init>(AnnotationReader.java:47)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader.<init>(AbstractIncrementalCorpusReader.java:61)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.<init>(XmlDocumentReader.java:89)
	at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.<init>(TACReader.java:113)
	at PreprocessTAC.main(PreprocessTAC.java:7)

lashmore avatar Aug 30 '21 18:08 lashmore