SubjQA icon indicating copy to clipboard operation
SubjQA copied to clipboard

Fix human_ans_spans entries to match snippets from human_ans_indices

Open lewtun opened this issue 3 years ago • 0 comments

This PR fixes a mismatch between some entries of the human_ans_spans column and the corresponding span of text in the review column. For example, line 17 of electronics/splits/train.csv has in the human_ans_spans column the text

The anti - glare function does work as described

but the actual text in the review column does not have any spaces around "anti-glare"

The anti-glare function does work as described

Looking at other examples, it seems that some sort of post-processing has been applied to create whitespace around punctuation characters. By using the start and end indices in the human_ans_indices column as the ground truth, I find the following mismatch percentages per domain/split:

restaurants electronics books grocery movies tripadvisor
test.csv 7.13 7.52 7.37 10.44 7.39 9.9
dev.csv 10.88 7.28 5.28 10.09 4.68 7.5
train.csv 8.59 7.76 8.06 9.82 8.48 8.82

To fix this, I utilised the following function

import pandas as pd
from pathlib import Path

def fix_answer_spans(path_to_file: Path):
    def extract_answer_spans(row: pd.Series):
        start_idx, end_idx = eval(row["human_ans_indices"])
        return row["review"][start_idx:end_idx]
    
    df = pd.read_csv(f)
    df["human_ans_spans"] = df.apply(extract_answer_spans, axis=1)
    df.to_csv(f, index=False)

lewtun avatar Mar 12 '21 09:03 lewtun