factuality-datasets How to run the trained factuality model(ENT-C_sent-factuality) on non-preprocessed (input, summary) pairs

Hi, thanks for the nice work. How can I use the ENT-C_sent-factuality model that trained on the data synthesized by cnndm for the non-preprocessed (input, summary) pairs. Thanks again.

Jun 15 '21 09:06 gaozhiguang

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

Jun 16 '21 22:06 tagoyal

Hi, thanks. but i am still a little confused about what is a PTB tokenizer? the code in train_utils.py uses the stanford corenlp for tokenizer. In addition, for some case, the function "get_tokenized_text(input_text, nlp):" will cause an error, i believe the reason is that: tokenized_json = nlp.annotate(input_text, properties={'annotators': 'tokenize', 'outputFormat': 'json', 'ssplit.isOneSentence': True}) code above cannot run properly, do you know why? Thanks again.

Jun 17 '21 13:06 gaozhiguang

Ah yes, you are right! There's no need to tokenize if you use that script.

Can you send specific examples on which you can that error?

Jun 17 '21 16:06 tagoyal

Hi, this will cause an error,

from pycorenlp import StanfordCoreNLP nlp = StanfordCoreNLP('http://localhost:9000')

line="teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 % of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones"

parse = nlp.annotate(line, properties={'annotators': 'tokenize,ssplit,pos,depparse', 'outputFormat': 'json', ... 'ssplit.isOneSentence': True})

print(parse)

Could not handle incoming annotation

Jun 18 '21 03:06 gaozhiguang

Hi, i have another try, when i set: line = "teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones" it works well. the difference is "60 %" with the symbol "%" , and when i change to "60", it can work.

Jun 18 '21 03:06 gaozhiguang

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

Hi, the instructions included in the 'Running pre-trained factuality models' are for preprocessed dev files, i don't know how to process my data to that format. Thanks.

Jul 08 '21 12:07 gaozhiguang

Hi, there are detailed instructions further down in the readme for how to run on non-preprocessed data.

But, very briefly, you can run to evaluate non-preprocessed summaries:

python3 evaluate_generated_outputs.py \
        --model_type electra_dae \
        --model_dir $MODEL_DIR  \
        --input_file sample_test.txt

The format of sample_test is included in the READme, as are some additional information, such as lowercasing and tokenization requirements.

Jul 09 '21 20:07 tagoyal

factuality-datasets factuality-datasets copied to clipboard

How to run the trained factuality model(ENT-C_sent-factuality) on non-preprocessed (input, summary) pairs

Could not handle incoming annotation

factuality-datasets
factuality-datasets copied to clipboard