SubjQA
SubjQA copied to clipboard
Fix human_ans_spans entries to match snippets from human_ans_indices
This PR fixes a mismatch between some entries of the human_ans_spans
column and the corresponding span of text in the review
column. For example, line 17 of electronics/splits/train.csv has in the human_ans_spans
column the text
The anti - glare function does work as described
but the actual text in the review
column does not have any spaces around "anti-glare"
The anti-glare function does work as described
Looking at other examples, it seems that some sort of post-processing has been applied to create whitespace around punctuation characters. By using the start and end indices in the human_ans_indices
column as the ground truth, I find the following mismatch percentages per domain/split:
restaurants | electronics | books | grocery | movies | tripadvisor | |
---|---|---|---|---|---|---|
test.csv | 7.13 | 7.52 | 7.37 | 10.44 | 7.39 | 9.9 |
dev.csv | 10.88 | 7.28 | 5.28 | 10.09 | 4.68 | 7.5 |
train.csv | 8.59 | 7.76 | 8.06 | 9.82 | 8.48 | 8.82 |
To fix this, I utilised the following function
import pandas as pd
from pathlib import Path
def fix_answer_spans(path_to_file: Path):
def extract_answer_spans(row: pd.Series):
start_idx, end_idx = eval(row["human_ans_indices"])
return row["review"][start_idx:end_idx]
df = pd.read_csv(f)
df["human_ans_spans"] = df.apply(extract_answer_spans, axis=1)
df.to_csv(f, index=False)