gensim-data
gensim-data copied to clipboard
WikiQA Corpus for Question Answering
Link : https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
Paper: https://aclweb.org/anthology/D15-1237
Description: Wikiqa is a QA dataset which is well studied for QA systems. It has a predefined trin/dev/test split and comes in a .tsv and .txt format. Basically, there is a question(q) and for every question there are several candidate documents (d1, d2, ..). for the question-document pair there is a relevance value. 1 : relevant, 0 : not relevant.
q1 - d1 - 0
q1 - d2 - 1
q1 - d3 - 0
q2 - d4 - 1
q2 - d5 - 0
.
.
.
Here is an example from the dataset:
QuestionID | Question | DocumentID | DocumentTitle | SentenceID | Sentence | Label |
---|---|---|---|---|---|---|
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-0 | Cross section of sclerenchyma fibers in plant ground tissue | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-1 | Microscopic view of a histologic specimen of human lung tissue stained with hematoxylin and eosin . | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-2 | In Biology , Tissue is a cellular organizational level intermediate between cells and a complete organism . | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-3 | A tissue is an ensemble of similar cells from the same origin that together carry out a specific function. | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-4 | Organs are then formed by the functional grouping together of multiple tissues. | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-5 | The study of tissue is known as histology or, in connection with disease, histopathology . | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-6 | The classical tools for studying tissues are the paraffin block in which tissue is embedded and then sectioned, the histological stain , and the optical microscope . | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-7 | In the last couple of decades, developments in electron microscopy , immunofluorescence , and the use of frozen tissue sections have enhanced the detail that can be observed in tissues. | 0 |
Q8 | How are epithelial tissues joined together? | D8 | Tissue (biology) | D8-8 | With these tools, the classical appearances of tissues can be examined in health and disease, enabling considerable refinement of clinical diagnosis and prognosis . | 0 |
Q11 | how big is bmc software in houston, tx | D11 | BMC Software | D11-0 | BMC Software, Inc. is an American company specializing in Business Service Management (BSM) software. | 0 |
Q11 | how big is bmc software in houston, tx | D11 | BMC Software | D11-1 | Headquartered in Houston , Texas , BMC develops, markets and sells software used for multiple functions, including IT service management, data center automation, performance management, virtualization lifecycle management and cloud computing management. | 0 |
Q11 | how big is bmc software in houston, tx | D11 | BMC Software | D11-2 | The name "BMC" is taken from the surnames of its three founders—Scott Boulette, John Moores, and Dan Cloer. | 0 |
Q11 | how big is bmc software in houston, tx | D11 | BMC Software | D11-3 | Employing over 6,000, BMC is often credited with pioneering the BSM concept as a way to help better align IT operations with business needs. | 1 |
I also provide a data reader which will make the dataset easily available for use.
"""This file contains WikiReaderIterable and WikiReaderStatic for handling the WikiQA dataset
Use WikiReaderIterable when you want data in the format of query, docs, labels seperately
Example:
query_iterable = WikiReaderIterable('query', path_to_file)
Use WikiReaderStatic when you want a dump of the test data with the doc_ids and query_ids
It is useful for saving predictions in the TREC format
A datapoint in this dataset has a query, a document and thier relevance(0: irrelevant, 1: relevant)
Example data point:
QuestionID Question DocumentID DocumentTitle SentenceID Sentence Label
Q8 How are epithelial tissues joined together? D8 Tissue (biology) D8-0 Cross section of sclerenchyma fibers in plant ground tissue 0
"""
import numpy as np
import re
import csv
class WikiReaderIterable:
"""Returns an iterable for the given `iter_type` after extracting from the WikiQA tsv
Parameters
----------
iter_type : {'query', 'doc', 'label'}
The type of data point
fpath : str
Path to the .tsv file
"""
def __init__(self, iter_type, fpath):
self.type_translator = {'query': 0, 'doc': 1, 'label': 2}
self.iter_type = iter_type
with open(fpath, encoding='utf8') as tsv_file:
tsv_reader = csv.reader(tsv_file, delimiter='\t', quotechar='"', quoting=csv.QUOTE_NONE)
self.data_rows = []
for row in tsv_reader:
self.data_rows.append(row)
def preprocess_sent(self, sent):
"""Utility function to lower, strip and tokenize each sentence
Replace this function if you want to handle preprocessing differently
Parameters
----------
sent : str
"""
return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()
def __iter__(self):
# Defining some consants for .tsv reading
# These refer to the column indexes of certain data
QUESTION_ID_INDEX = 0
QUESTION_INDEX = 1
ANSWER_INDEX = 5
LABEL_INDEX = 6
# We will be grouping all documents and labels which belong to one question into
# one group. This helps in getting MAP scores.
document_group = []
label_group = []
# We keep count of number of documents so we can remove those question-doc pairs
# which do not have even one relevant document
n_relevant_docs = 0
n_filtered_docs = 0
queries = []
docs = []
labels = []
for i, line in enumerate(self.data_rows[1:], start=1):
if i < len(self.data_rows) - 1: # check if out of bounds might occur
# If the question id index doesn't change
if self.data_rows[i][QUESTION_ID_INDEX] == self.data_rows[i + 1][QUESTION_ID_INDEX]:
document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
label_group.append(int(self.data_rows[i][LABEL_INDEX]))
n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
else:
document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
label_group.append(int(self.data_rows[i][LABEL_INDEX]))
n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
if n_relevant_docs > 0:
docs.append(document_group)
labels.append(label_group)
queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))
yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
else:
# Filter out a question if it doesn't have a single relevant document
n_filtered_docs += 1
n_relevant_docs = 0
document_group = []
label_group = []
else:
# If we are on the last line
document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
label_group.append(int(self.data_rows[i][LABEL_INDEX]))
n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
if n_relevant_docs > 0:
docs.append(document_group)
labels.append(label_group)
queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))
# Return the index of the doc requested
yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
else:
n_filtered_docs += 1
n_relevant_docs = 0
Thanks @aneesh-joshi, useful dataset :+1: (don't forget to raise an issues for other datasets that you used for evaluation)
how can we use this dataset in a question answering system?/
@anagha1198 If you group by the first column (ie, you group all the rows with the same Question ID, you will get all the corresponding document IDs) So, the Question is the questions and all the documents are the options (for picking the most relevant answer)