factuality-datasets icon indicating copy to clipboard operation
factuality-datasets copied to clipboard

How to run the trained factuality model(ENT-C_sent-factuality) on non-preprocessed (input, summary) pairs

Open gaozhiguang opened this issue 3 years ago • 7 comments

Hi, thanks for the nice work. How can I use the ENT-C_sent-factuality model that trained on the data synthesized by cnndm for the non-preprocessed (input, summary) pairs. Thanks again.

gaozhiguang avatar Jun 15 '21 09:06 gaozhiguang

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

tagoyal avatar Jun 16 '21 22:06 tagoyal

Hi, thanks. but i am still a little confused about what is a PTB tokenizer? the code in train_utils.py uses the stanford corenlp for tokenizer. In addition, for some case, the function "get_tokenized_text(input_text, nlp):" will cause an error, i believe the reason is that: tokenized_json = nlp.annotate(input_text, properties={'annotators': 'tokenize', 'outputFormat': 'json', 'ssplit.isOneSentence': True}) code above cannot run properly, do you know why? Thanks again.

gaozhiguang avatar Jun 17 '21 13:06 gaozhiguang

Ah yes, you are right! There's no need to tokenize if you use that script.

Can you send specific examples on which you can that error?

tagoyal avatar Jun 17 '21 16:06 tagoyal

Hi, this will cause an error,

from pycorenlp import StanfordCoreNLP nlp = StanfordCoreNLP('http://localhost:9000')

line="teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 % of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones"

parse = nlp.annotate(line, properties={'annotators': 'tokenize,ssplit,pos,depparse', 'outputFormat': 'json', ... 'ssplit.isOneSentence': True})

print(parse)

Could not handle incoming annotation

gaozhiguang avatar Jun 18 '21 03:06 gaozhiguang

Hi, i have another try, when i set: line = "teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones" it works well. the difference is "60 %" with the symbol "%" , and when i change to "60", it can work.

gaozhiguang avatar Jun 18 '21 03:06 gaozhiguang

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

Hi, the instructions included in the 'Running pre-trained factuality models' are for preprocessed dev files, i don't know how to process my data to that format. Thanks.

gaozhiguang avatar Jul 08 '21 12:07 gaozhiguang

Hi, there are detailed instructions further down in the readme for how to run on non-preprocessed data.

But, very briefly, you can run to evaluate non-preprocessed summaries:

python3 evaluate_generated_outputs.py \
        --model_type electra_dae \
        --model_dir $MODEL_DIR  \
        --input_file sample_test.txt

The format of sample_test is included in the READme, as are some additional information, such as lowercasing and tokenization requirements.

tagoyal avatar Jul 09 '21 20:07 tagoyal