UniversalPetrarch
UniversalPetrarch copied to clipboard
Error when running preprocess_doc.py
I'm getting an error when I try to run the built in English demo. I've downloaded CoreNLP, UDPipe, and the models, but I'm hitting an error in the Python code that runs right after CoreNLP.
Does the demo not work with the built in GigaWord.sample.PETR.xml
file?
Here's the error:
ahalterman:preprocessing$ bash run_document.sh GigaWord.sample.PETR.xml
Call Stanford CoreNLP to do sentence splitting...
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
Processing file /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml ... writing to /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.out
Annotating file /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml ... done [0.3 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.2 sec.
CleanXmlAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.3 sec. for 12272 tokens at 38470.2 tokens/sec.
Pipeline setup: 0.1 sec.
Total time for StanfordCoreNLP pipeline: 0.8 sec.
Generate sentence xml file...
Traceback (most recent call last):
File "preprocess_doc.py", line 161, in <module>
read_doc_input(inputxml,inputparsed,outputfile)
File "preprocess_doc.py", line 96, in read_doc_input
doc = doctexts[0]
IndexError: list index out of range
Hi Andy,
The error is caused by the fact that Stanford CoreNLP processes both "text" and "parse" elements in the GigaWord.sample.PETR.xml. I have fixed the errors. Please try run_sentence.sh GigaWord.sample.PETR.xml
again.
Thanks, @JingL1014! That fixed that problem, but now I'm hitting another. The initial Python and the CoreNLP are both working, but I get this error when I try to load the UDPipe model:
Call udpipe to do pos tagging and dependency parsing...
Loading UDPipe model: Cannot load UDPipe model '/Users/ahalterman/MIT/NSF_RIDIR/udpipe-1.0.0-bin/models/english-ud-2.0-170801.udpipe'!
I've double checked the path so I think it's a problem elsewhere.
I also ran into another issue that was fixed by specifying Python 2 in run_sentence.sh
:
ahalterman:preprocessing$ bash run_sentence.sh GigaWord.sample.PETR.xml
Prepare file for stanford CoreNLP
Traceback (most recent call last):
File "preprocess_sent.py", line 67, in <module>
main()
File "preprocess_sent.py", line 63, in main
read_sentence_input(inputxml,outputfile)
File "preprocess_sent.py", line 56, in read_sentence_input
ofile.write(line.encode('utf-8')+"\n")
TypeError: can't concat bytes to str
When I change the code to python2
this error goes away.
Hi Andy,
I think it is because of the mismatch of udpipe version and language model version. I am using udpipe-1.0.0 and language model with UD 1.2 ( http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models) This works properly. I think in order to use language model with UD 2.0, you have to use udpipe-1.2.0.
With the correct model, it ran just fine. Thanks!
Hey,
I am having a similar issue. When I try to run ./run_document.sh Sample_english_doc.xml
. The Error output is similar to the one Andy showed, here it is :
Call Stanford CoreNLP to do sentence splitting...
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
Processing file /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml ... writing to /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml.out
Annotating file /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml ... done [0.1 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
CleanXmlAnnotator: 0.0 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 219 tokens at 2517.2 tokens/sec.
Pipeline setup: 0.1 sec.
Total time for StanfordCoreNLP pipeline: 0.3 sec.
Generate sentence xml file...
Traceback (most recent call last):
File "preprocess_doc.py", line 161, in <module>
read_doc_input(inputxml,inputparsed,outputfile)
File "preprocess_doc.py", line 113, in read_doc_input
doc = doctexts[0]
IndexError: list index out of range
Call udpipe to do pos tagging and dependency parsing...
readline() on closed filehandle DOC at /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/scripts/create_conll_corpus_from_text.pl line 6.
Loading UDPipe model: done.
Ouput parsed xml file...
Traceback (most recent call last):
File "generateParsedFile.py", line 47, in <module>
update_xml_input(inputFile,parsedFile,outputFile)
File "generateParsedFile.py", line 15, in update_xml_input
xml_file = io.open(inputfile,'rb')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml-sent.xml'
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml-sent.txt: No such file or directory
I have also tried ./run_sentence.sh GigaWord.sample.PETR.xml
, and It gives me another error. I got the following error:
Prepare file for stanford CoreNLP
Call Stanford CoreNLP to do tokenization...
property file path:
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: argsToProperties could not read properties file: true
at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1011)
at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:927)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1416)
Caused by: java.io.IOException: Unable to open "true" as class path, filename or URL
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
at edu.stanford.nlp.io.IOUtils.readerFromString(IOUtils.java:618)
at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1002)
... 2 more
Generate sentence xml file...
Traceback (most recent call last):
File "preprocess.py", line 140, in <module>
read_doc_input(inputxml,inputparsed,outputfile)
File "preprocess.py", line 61, in read_doc_input
parsed = io.open(inputparsed,'r',encoding='utf-8')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.raw.txt.out'
readline() on closed filehandle DOC at /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/scripts/create_conll_corpus_from_text.pl line 6.
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.raw.txt.out: No such file or directory
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.txt: No such file or directory
Call udpipe to do pos tagging and dependency parsing...
Udpipe model path:
Loading UDPipe model: Cannot load UDPipe model '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll'!
Ouput parsed xml file...
Traceback (most recent call last):
File "generateParsedFile.py", line 47, in <module>
update_xml_input(inputFile,parsedFile,outputFile)
File "generateParsedFile.py", line 9, in update_xml_input
pfile = io.open(parsedfile,'r',encoding='utf-8')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll.predpos.pred'
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll.predpos.pred: No such file or directory
Any ideas on what could be the problem?
Thanks
@JingL1014 This looks like the error I got before you updated the code. Any idea what's going on?
@khaledJabr For the second problem, you missed an argument. You have to specify the language of the input file, the value is EN, ES or AR. Please run ./run_sentence.sh GigaWord.sample.PETR.xml EN
@khaledJabr For the first problem, I can run the code on my machine without any error. May I know which version of Stanford CoreNLP are you using?
@JingL1014
I need to clarify one thing. There are two text
folders in the repo. One is in /UniversalPetrarch/preprocessing
and one is in /UniversalPetrarch/data/text
. When I configured run_sentence.sh
and run_document.sh
, I set the FILEPATH
and FILE
to /UniversalPetrarch/data/text
, and that's where Sample_english_doc.xml
exists.
I am still getting the same error when I run ./run_document.sh Sample_english_doc.xml
.
However, After ./run_sentence.sh GigaWord.sample.PETR.xml EN
, I found out that I didn't have the Udpipe models installed correctly. I fixed that and it works correctly now.
I am using the latest version CoreNLP, 3.9.0
@khaledJabr I updated run_document.sh, right now this script also requires an input argument to specify the language. please try ./run_document.sh Sample_english_doc.xml EN
again. I was not able to download CoreNLP 3.9.0, but I tested on CorefNLP 3.8.0, and it runs correctly. If it is still not working on CoreNLP 3.9.0., could you comment line 77-80 , and send me those files generated in the intermediate steps?