FollowUp
FollowUp copied to clipboard
public dataset for followup-query analysis, accepted by AAAI2019
FollowUp Dataset
Recent work on Natural Language Interfaces to Databases (NLIDB) has attracted considerable attention. NLIDB allow users to search databases using natural language instead of SQL-like query languages. While saving the users from having to learn query languages, multi-turn interaction with NLIDB usually involves multiple queries where contextual information is vital to understand the users' query intents. In this paper, we address a typical contextual understanding problem, termed as follow-up query analysis. Our work summarizes typical follow-up query scenarios and provides the new FollowUp
dataset with 1000 query triples on 120 tables.
Citation
If you use FollowUp
in your research work, please consider citing our work:
Qian Liu, Bei Chen, Jian-Guang Lou, Ge Jin and Dongmei Zhang. 2019. FANDA: A Novel Approach to Perform Follow-up Query Analysis. In AAAI.
@inproceedings{liu2019fanda,
title={\textsc{FAnDa}: A Novel Approach to Perform Follow-up Query Analysis},
author={Liu, Qian and Chen, Bei and Lou, Jian-Guang and Jin, Ge and Zhang, Dongmei},
booktitle ={AAAI},
year={2019}
}
Evaluation
You could easily evalute your model output on FollowUp dataset following our data/eval.py
script. Put your model prediction (as format of string) case by case under the file data/predict.example
, then run the data/eval.py
as following:
python eval.py
You will get the evalution result test set of FollowUp
. For example, the example prediction result will get the result of:
================================================================================
FollowUp Dataset Evaluation Result
================================================================================
BLEU Score: 100.00 (%)
Symbol Acc: 100.00 (%)
Processed Data
To alleviate the burden of preprocessing data, we provide our processed datasets in the folder data_processed
, and the script will be released soon. The original dataset is placed under data
folder.
Tables
tables.jsonl: store the table information, and every line(table) is a json format object. header
means the column names, types
means the types inherited from WikiSQL, id
indicates the table ids origination in WikiSQL, rows
are the values of whole table. A line looks like the following:
{
"header": [
"Date",
"Opponent",
"Venue",
"Result",
"Attendance",
"Competition"
],
"page_title": "2007–08 Guildford Flames season",
"types": [
"real",
"text",
"text",
"text",
"real",
"text"
],
"page_id": 15213262,
"id": [
"2-15213262-12",
"2-15213262-7"
],
"section_title": "March",
"rows": [
[
"6",
"Milton Keynes Lightning",
"Away",
"Lost 3-5 (Lightning win 11-6 on aggregate)",
"537",
"Knockout Cup Semi-Final 2nd Leg"
],
[
"8",
"Romford Raiders",
"Home",
"Won 7-3",
"1,769",
"League"
],
...
[
"28",
"Chelmsford Chieftains",
"Away",
"Won 3-2",
"474",
"Premier Cup"
]
],
"caption": "March"
}
Content
train.tsv and test.tsv: train/test split of FollowUp Dataset. Every line is a tuple of format (Precendent Query, Follow-up Query, Fused Query, Table ID), where the Table ID is line index starting from 1 in tables.jsonl
. Split symbol is TAB(\t). A line looks like the following:
how many champions were there, according to this table? show these champions for different all-star game. show champions for different all-star game. 74
Concat
If you have any question or have difficulity in applying your model on the FollowUp dataset, please feel free to concat me: qian.liu AT buaa dot edu dot cn. Sure, you could also create a new issue and I will tackle them as soon as possible.