cnn-dailymail icon indicating copy to clipboard operation
cnn-dailymail copied to clipboard

Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer

Open TianlinZhang668 opened this issue 5 years ago • 8 comments

i run makedatafiles.py. but it has an error: Preparing to tokenize /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized... Making list of files to tokenize... Tokenizing 92579 files in /home/ztl/Downloads/cnn_stories/cnn/stories and saving in cnn_stories_tokenized... Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.process.PTBTokenizer Stanford CoreNLP Tokenizer has finished. Traceback (most recent call last):

However i can run echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer in the root i dont know how to deal with? thanks a lot

TianlinZhang668 avatar Apr 09 '19 02:04 TianlinZhang668

i run the corenlp-3.9.2.jar

TianlinZhang668 avatar Apr 09 '19 02:04 TianlinZhang668

You need stanford-corenlp-3.7.0.jar. See this: https://github.com/abisee/cnn-dailymail#2-download-stanford-corenlp Please read the README.md file.

ubaidsworld avatar Apr 09 '19 05:04 ubaidsworld

Successfully finished tokenizing /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized.

Making bin file for URLs listed in url_lists/all_test.txt... Traceback (most recent call last): File "make_datafiles.py", line 239, in write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin")) File "make_datafiles.py", line 154, in write_to_bin url_hashes = get_url_hashes(url_list) File "make_datafiles.py", line 106, in get_url_hashes return [hashhex(url) for url in url_list] File "make_datafiles.py", line 106, in return [hashhex(url) for url in url_list] File "make_datafiles.py", line 101, in hashhex h.update(s) TypeError: Unicode-objects must be encoded before hashing

i have got the tokenized, but next ....

TianlinZhang668 avatar Apr 09 '19 06:04 TianlinZhang668

Try this: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail Guess it will solve your tokenization and rest other issues.

JafferWilson avatar Apr 09 '19 10:04 JafferWilson

if I have content of the article that isn't the same as structure of the CNN's article

quanghuynguyen1902 avatar May 09 '19 08:05 quanghuynguyen1902

@quanghuynguyen1902 Guess you already have opened a new issue https://github.com/abisee/cnn-dailymail/issues/29 Lets go there. Please someone close this issue.

JafferWilson avatar May 09 '19 10:05 JafferWilson

I am facing the same issue in here.

mooncrater31 avatar Dec 28 '19 07:12 mooncrater31

source ./.bash_profile

SpaceTime1999 avatar Sep 03 '21 13:09 SpaceTime1999