SubjQA Fix human_ans_spans entries to match snippets from human_ans

Fix human_ans_spans entries to match snippets from human_ans_indices

Open lewtun opened this issue 3 years ago • 0 comments

This PR fixes a mismatch between some entries of the human_ans_spans column and the corresponding span of text in the review column. For example, line 17 of electronics/splits/train.csv has in the human_ans_spans column the text

The anti - glare function does work as described

but the actual text in the review column does not have any spaces around "anti-glare"

The anti-glare function does work as described

Looking at other examples, it seems that some sort of post-processing has been applied to create whitespace around punctuation characters. By using the start and end indices in the human_ans_indices column as the ground truth, I find the following mismatch percentages per domain/split:

	restaurants	electronics	books	grocery	movies	tripadvisor
test.csv	7.13	7.52	7.37	10.44	7.39	9.9
dev.csv	10.88	7.28	5.28	10.09	4.68	7.5
train.csv	8.59	7.76	8.06	9.82	8.48	8.82

To fix this, I utilised the following function

import pandas as pd
from pathlib import Path

def fix_answer_spans(path_to_file: Path):
    def extract_answer_spans(row: pd.Series):
        start_idx, end_idx = eval(row["human_ans_indices"])
        return row["review"][start_idx:end_idx]
    
    df = pd.read_csv(f)
    df["human_ans_spans"] = df.apply(extract_answer_spans, axis=1)
    df.to_csv(f, index=False)

Mar 12 '21 09:03 lewtun

SubjQA SubjQA copied to clipboard

Fix human_ans_spans entries to match snippets from human_ans_indices

SubjQA
SubjQA copied to clipboard