cnn-dailymail icon indicating copy to clipboard operation
cnn-dailymail copied to clipboard

error while running make_datafile.py

Open 97yogitha opened this issue 7 years ago • 11 comments

@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories

Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 235, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`

97yogitha avatar Oct 28 '17 03:10 97yogitha

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

JafferWilson avatar Oct 30 '17 03:10 JafferWilson

I had a similar issue, though not sure if it's the same cause. See: https://github.com/abisee/cnn-dailymail/issues/12

On Sun, Oct 29, 2017 at 8:29 PM, Jaffer Wilson [email protected] wrote:

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/abisee/cnn-dailymail/issues/16#issuecomment-340334829, or mute the thread https://github.com/notifications/unsubscribe-auth/AHM9T-dG4T41-xYVGiyZCN2ZD412WrNAks5sxUKzgaJpZM4QJzMu .

ibarrien avatar Oct 30 '17 03:10 ibarrien

I have created already the processed file you can try that without any issue. Here is the link: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail Use Python 2.7

JafferWilson avatar Oct 30 '17 04:10 JafferWilson

@JafferWilson yes i am using stanford-corenlp-full-2017-09-0/stanford-corenlp-3.8.0.jar. I will use the processed file.

97yogitha avatar Oct 30 '17 05:10 97yogitha

@97yogitha No do not use the 2017 one.. use 2016 which is mentioned in the Read.me file of the repository.

JafferWilson avatar Oct 30 '17 05:10 JafferWilson

@JafferWilson Thanks for the help. I used 3.7.0 from https://stanfordnlp.github.io/CoreNLP/history.html and it worked.

IreneZihuiLi avatar Nov 23 '17 20:11 IreneZihuiLi

thanks very much, today I encountered this problem with the newest version 3.8.0, and then I changed to 3.7.0, finally, it worked.

Neuqmiao avatar Dec 07 '17 13:12 Neuqmiao

Please some one close this issue.

JafferWilson avatar Dec 08 '17 05:12 JafferWilson

@JafferWilson Could you help in running the nueral network against our own data, how to generate .bin files for our article?

I have clear idea about tokenozation but what about the urls mapping? How to do it?

Sharathnasa avatar Jan 14 '18 13:01 Sharathnasa

Hi @Sharathnasa You can clone below repository: https://github.com/dondon2475848/make_datafiles_for_pgn Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

dondon2475848 avatar Mar 07 '18 00:03 dondon2475848

check subprocess.call(command) set classpath using os.environ["CLASSPATH"]='stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar', then run

ARNABKUMARPAN avatar Nov 11 '19 09:11 ARNABKUMARPAN