gensim-data WikiQA Corpus for Question Answering

WikiQA Corpus for Question Answering

Open aneesh-joshi opened this issue 6 years ago • 4 comments

Link : https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip

Paper: https://aclweb.org/anthology/D15-1237

Description: Wikiqa is a QA dataset which is well studied for QA systems. It has a predefined trin/dev/test split and comes in a .tsv and .txt format. Basically, there is a question(q) and for every question there are several candidate documents (d1, d2, ..). for the question-document pair there is a relevance value. 1 : relevant, 0 : not relevant.

q1 - d1 - 0
q1 - d2 - 1
q1 - d3 - 0
q2 - d4 - 1
q2 - d5 - 0
.
.
.

Here is an example from the dataset:

QuestionID	Question	DocumentID	DocumentTitle	SentenceID	Sentence	Label
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-0	Cross section of sclerenchyma fibers in plant ground tissue	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-1	Microscopic view of a histologic specimen of human lung tissue stained with hematoxylin and eosin .	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-2	In Biology , Tissue is a cellular organizational level intermediate between cells and a complete organism .	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-3	A tissue is an ensemble of similar cells from the same origin that together carry out a specific function.	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-4	Organs are then formed by the functional grouping together of multiple tissues.	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-5	The study of tissue is known as histology or, in connection with disease, histopathology .	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-6	The classical tools for studying tissues are the paraffin block in which tissue is embedded and then sectioned, the histological stain , and the optical microscope .	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-7	In the last couple of decades, developments in electron microscopy , immunofluorescence , and the use of frozen tissue sections have enhanced the detail that can be observed in tissues.	0
Q8	How are epithelial tissues joined together?	D8	Tissue (biology)	D8-8	With these tools, the classical appearances of tissues can be examined in health and disease, enabling considerable refinement of clinical diagnosis and prognosis .	0
Q11	how big is bmc software in houston, tx	D11	BMC Software	D11-0	BMC Software, Inc. is an American company specializing in Business Service Management (BSM) software.	0
Q11	how big is bmc software in houston, tx	D11	BMC Software	D11-1	Headquartered in Houston , Texas , BMC develops, markets and sells software used for multiple functions, including IT service management, data center automation, performance management, virtualization lifecycle management and cloud computing management.	0
Q11	how big is bmc software in houston, tx	D11	BMC Software	D11-2	The name "BMC" is taken from the surnames of its three founders—Scott Boulette, John Moores, and Dan Cloer.	0
Q11	how big is bmc software in houston, tx	D11	BMC Software	D11-3	Employing over 6,000, BMC is often credited with pioneering the BSM concept as a way to help better align IT operations with business needs.	1

I also provide a data reader which will make the dataset easily available for use.

Aug 12 '18 21:08 aneesh-joshi

"""This file contains WikiReaderIterable and WikiReaderStatic for handling the WikiQA dataset

Use WikiReaderIterable when you want data in the format of query, docs, labels seperately
Example:
query_iterable = WikiReaderIterable('query', path_to_file)

Use WikiReaderStatic when you want a dump of the test data with the doc_ids and query_ids
It is useful for saving predictions in the TREC format

A datapoint in this dataset has a query, a document and thier relevance(0: irrelevant, 1: relevant)

Example data point:
QuestionID  Question    DocumentID  DocumentTitle   SentenceID  Sentence    Label
Q8  How are epithelial tissues joined together? D8  Tissue (biology)    D8-0    Cross section of sclerenchyma fibers in plant ground tissue 0

"""
import numpy as np
import re
import csv

class WikiReaderIterable:
    """Returns an iterable for the given `iter_type` after extracting from the WikiQA tsv

    Parameters
    ----------
    iter_type : {'query', 'doc', 'label'}
        The type of data point
    fpath : str
        Path to the .tsv file
    """

    def __init__(self, iter_type, fpath):
        self.type_translator = {'query': 0, 'doc': 1, 'label': 2}
        self.iter_type = iter_type
        with open(fpath, encoding='utf8') as tsv_file:
            tsv_reader = csv.reader(tsv_file, delimiter='\t', quotechar='"', quoting=csv.QUOTE_NONE)
            self.data_rows = []
            for row in tsv_reader:
                self.data_rows.append(row)

    def preprocess_sent(self, sent):
        """Utility function to lower, strip and tokenize each sentence
        Replace this function if you want to handle preprocessing differently

        Parameters
        ----------
        sent : str
        """
        return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

    def __iter__(self):
        # Defining some consants for .tsv reading
        # These refer to the column indexes of certain data
        QUESTION_ID_INDEX = 0
        QUESTION_INDEX = 1
        ANSWER_INDEX = 5
        LABEL_INDEX = 6

        # We will be grouping all documents and labels which belong to one question into
        # one group. This helps in getting MAP scores.
        document_group = []
        label_group = []

        # We keep count of number of documents so we can remove those question-doc pairs
        # which do not have even one relevant document
        n_relevant_docs = 0
        n_filtered_docs = 0

        queries = []
        docs = []
        labels = []

        for i, line in enumerate(self.data_rows[1:], start=1):
            if i < len(self.data_rows) - 1:  # check if out of bounds might occur
                # If the question id index doesn't change
                if self.data_rows[i][QUESTION_ID_INDEX] == self.data_rows[i + 1][QUESTION_ID_INDEX]:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
                else:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))

                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                    if n_relevant_docs > 0:
                        docs.append(document_group)
                        labels.append(label_group)
                        queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))

                        yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                    else:
                        # Filter out a question if it doesn't have a single relevant document
                        n_filtered_docs += 1

                    n_relevant_docs = 0
                    document_group = []
                    label_group = []

            else:
                # If we are on the last line
                document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                if n_relevant_docs > 0:
                    docs.append(document_group)
                    labels.append(label_group)
                    queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))
                    # Return the index of the doc requested
                    yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                else:
                    n_filtered_docs += 1
                    n_relevant_docs = 0

Aug 12 '18 21:08 aneesh-joshi

Thanks @aneesh-joshi, useful dataset :+1: (don't forget to raise an issues for other datasets that you used for evaluation)

Aug 13 '18 04:08 menshikh-iv

how can we use this dataset in a question answering system?/

Nov 03 '19 09:11 anagha1198

@anagha1198 If you group by the first column (ie, you group all the rows with the same Question ID, you will get all the corresponding document IDs) So, the Question is the questions and all the documents are the options (for picking the most relevant answer)

Nov 03 '19 16:11 aneesh-joshi

gensim-data gensim-data copied to clipboard

WikiQA Corpus for Question Answering

gensim-data
gensim-data copied to clipboard