pointer-generator
pointer-generator copied to clipboard
Tried running it on random internet news articles. Results look more extractive than abstractive?
Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?
Hi @anubhavmax, the same question has been asked here.
Yes - the pointer-generator model produces mostly extractive summaries. This is discussed in section 7.2 of the paper. It is the main area for future work!
@anubhavmax Hi, How did you manage to run on your own data? could you please shed some light.
Thanks, Sharath
@Sharathnasa, you need to run the text through the Stanford tokenizer Java program first in order to create a token list file to feed to the network.
Basically, in Linux, you run
cat normal_text.txt | java edu.stanford.nlp.process.PTBTokenizer -preserveLines
And it will print a tokenized version of the text, which you need to save to a new file. That file is then fed into the pointer generator network with the "--data_path=" argument and "--mode=decode".
@alkanen Thanks a lot man!. I will give a try. Text in the sense if i only pass the entire article without a abstract, it will work fine right?
or
Should i need to Process into .bin and vocab files as explained in cnn-daily repo? and one more thing, how is that url and stories 1-1 mapping is done, if i need to do so how to proceed?
@Sharathnasa text as in the entire article without an abstract, yes. That will create a bin file with a single article in it. Use the vocab file you already have from the CNN training set, it doesn't make much sense creating a new one based on a single article, and unless I misremember it will also break everything because the network will have trained on a particular vocab and that one needs to be used.
I'm afraid I never looked into the URL/stories mapping since that wasn't relevant for the work I did, so I can't help you there.
@alkanen Thanks once again man. When i try to run as you mentioned, i'm getting an error as below
vi womendriver.text | java edu.stanford.nlp.process.PTBTokenizer -preserveLines Vim: Warning: Output is not to a terminal Untokenizable: (U+1B, decimal: 27)
Would you please pass on the script if you have?
@Sharathnasa you can't pipe vi into java, use cat to pipe the contents of the text file into java
@alkanen Ok, my bad. Thanks once again. After performing tokenization(which i need to save), should i need to make_datafile.py code to generate .bin files?
Nope, just use the old vocab file used for training, and the file created by tokenization as input to the model:
python pointer-generator/run_summarization.py --log_root=<some path with trained models in it> --exp_name=<the name of your trained model> --vocab_path=<your old vocab file> --mode=decode --data_path=<the file generated by tokenizer>
@alkanen did you took a look at this https://github.com/abisee/pointer-generator/issues/51
No, anything in particular there you mean I should be aware of?
I never had the need to summarize multiple texts at once, so I haven't looked into that use case at all.
@alkanen Nothing in particular, just wanted to let you know the command he suggested to run.
One more query i have:
- Repo says input should be in the form of .bin files, but the tokenization we did is in the form of .bin format, will the network run?
- Whatever you had suggested is to run single article?
hi @alkanen when i run the below command
python3 pointer-generator/run_summarization.py --mode=decode --data_path=/Users/setup/text_abstraction/cnn-dailymail/finished_files/chunked/train_* --vocab_path=/Users/setup/text_abstraction/finished_files/vocab --log_root=/Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train --exp_name="model-238410.data-00000-of-00001" --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100
Im getting the logs as below INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs..
Where is it gone wrong?
Hi @Sharathnasa You can clone below repository: https://github.com/dondon2475848/make_datafiles_for_pgn Run
python make_datafiles.py ./stories ./output
It processes your test data into the binary format
@dondon2475848 I tried your repo with a sample txt file under stories folder and the .bin files didnt get created only tokenized file did. I am not sure why
Do you put xxx.txt under stories folder ? Maybe you can try xxx.story. format like below:
test1.story
MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth.
A South Korean bioengineer was one of three people on board the Soyuz capsule.
The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said.
Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry.
Search helicopters took 25 minutes to find the capsule and determine that the crew was unharmed.
Officials said the craft followed a very steep trajectory that subjects the crew to gravitational forces of up to 10 times those on Earth.
Interfax reported that the spacecraft's landing was rough.
This is not the first time a spacecraft veered from its planned trajectory during landing.
In October, the Soyuz capsule landed 70 kilometers from the planned area because of a damaged control cable. The capsule was carrying two Russian cosmonauts and the first Malaysian astronaut. E-mail to a friend
@highlight
Soyuz capsule lands hundreds of kilometers off-target
@highlight
Capsule was carrying South Korea's first astronaut
@highlight
Landing is second time Soyuz capsule has gone awry
@Sharathnasa I don't know if you still have this issue but I think I figure it out. I had the same issue with the TraceBack, do you run on TensorFlow 1.5? You can check at my repo, I fork the code of @becxer in python3 and modify it for Tensorflow 1.5 (still loading the TF 1.2.1 model presented in @abisee repo). Not so much work, TF 1.5 has a really bad support for tf.tags so I modify the code to make it works. If you look at your error, go to .utils.py and print the exception in the load_checkpoint() function.For me it came from the fact that 4 words in the vocab_meta.tsv were not added to the vocab so I had a shape issue, I made a small correction in the code to format the considered words and to add them to the vocab and it worked like a charm. You can check my code, tell me if there is a bug or anything, I will work it out!