FollowUp Dataset

Recent work on Natural Language Interfaces to Databases (NLIDB) has attracted considerable attention. NLIDB allow users to search databases using natural language instead of SQL-like query languages. While saving the users from having to learn query languages, multi-turn interaction with NLIDB usually involves multiple queries where contextual information is vital to understand the users' query intents. In this paper, we address a typical contextual understanding problem, termed as follow-up query analysis. Our work summarizes typical follow-up query scenarios and provides the new FollowUp dataset with 1000 query triples on 120 tables.

Citation

If you use FollowUp in your research work, please consider citing our work:

Qian Liu, Bei Chen, Jian-Guang Lou, Ge Jin and Dongmei Zhang. 2019. FANDA: A Novel Approach to Perform Follow-up Query Analysis. In AAAI.

@inproceedings{liu2019fanda,
  title={\textsc{FAnDa}: A Novel Approach to Perform Follow-up Query Analysis},
  author={Liu, Qian and Chen, Bei and Lou, Jian-Guang and Jin, Ge and Zhang, Dongmei},
  booktitle ={AAAI},
  year={2019}
}

Evaluation

You could easily evalute your model output on FollowUp dataset following our data/eval.py script. Put your model prediction (as format of string) case by case under the file data/predict.example, then run the data/eval.py as following:

python eval.py

You will get the evalution result test set of FollowUp. For example, the example prediction result will get the result of:

================================================================================
                     FollowUp Dataset Evaluation Result
================================================================================
BLEU Score:  100.00 (%)
Symbol Acc:  100.00 (%)

Processed Data

To alleviate the burden of preprocessing data, we provide our processed datasets in the folder data_processed, and the script will be released soon. The original dataset is placed under data folder.

Tables

tables.jsonl: store the table information, and every line(table) is a json format object. header means the column names, types means the types inherited from WikiSQL, id indicates the table ids origination in WikiSQL, rows are the values of whole table. A line looks like the following:

{
	"header": [
		"Date",
		"Opponent",
		"Venue",
		"Result",
		"Attendance",
		"Competition"
	],
	"page_title": "2007–08 Guildford Flames season",
	"types": [
		"real",
		"text",
		"text",
		"text",
		"real",
		"text"
	],
	"page_id": 15213262,
	"id": [
		"2-15213262-12",
		"2-15213262-7"
	],
	"section_title": "March",
	"rows": [
		[
			"6",
			"Milton Keynes Lightning",
			"Away",
			"Lost 3-5 (Lightning win 11-6 on aggregate)",
			"537",
			"Knockout Cup Semi-Final 2nd Leg"
		],
		[
			"8",
			"Romford Raiders",
			"Home",
			"Won 7-3",
			"1,769",
			"League"
		],
		...
		[
			"28",
			"Chelmsford Chieftains",
			"Away",
			"Won 3-2",
			"474",
			"Premier Cup"
		]
	],
	"caption": "March"
}

Content

train.tsv and test.tsv: train/test split of FollowUp Dataset. Every line is a tuple of format (Precendent Query, Follow-up Query, Fused Query, Table ID), where the Table ID is line index starting from 1 in tables.jsonl. Split symbol is TAB(\t). A line looks like the following:

how many champions were there, according to this table?	show these champions for different all-star game.	show champions for different all-star game.	74

Concat

If you have any question or have difficulity in applying your model on the FollowUp dataset, please feel free to concat me: qian.liu AT buaa dot edu dot cn. Sure, you could also create a new issue and I will tackle them as soon as possible.

FollowUp
FollowUp copied to clipboard

Metadata

FollowUp Dataset

Citation

Evaluation

Processed Data

Tables

Content

Concat

← Metadata

Owner

Metadata

FollowUp FollowUp copied to clipboard

Metadata

FollowUp Dataset

Citation

Evaluation

Processed Data

Tables

Content

Concat

← Metadata

Owner

Metadata

FollowUp
FollowUp copied to clipboard