text_content_manipulation
text_content_manipulation copied to clipboard
Text Content Manipulation
NBA Game Dataset for Text Content Manipulation
This is a dataset for the task of text content manipulation, as first proposed in the paper:
Toward Unsupervised Text Content Manipulation
Wentao Wang*, Zhiting Hu*, Zichao Yang, Haoran Shi, Frank Xu, Eric P. Xing; 2019
Data Format
Each example in the dataset consists of four elements, namely, (x, y_aux, x_ref, y_ref)
, where
-
x
is a content record containing a set of data tuplesx = {x_i}
. Each tuplex_i
contains three fields(type, value, associated)
. For example,x_i = (TEAM-AST, 25, Boston)
means "The Boston got 25 team assists". More specifically,-
type
: data type of the tuple, e.g.,TEAM-AST
,PLAYER-PTS
, etc. There are 34 data types in total. See the file x_type.vocab.txt for all data types. -
value
: value of the data. Usually a scalar number or a string (e.g., a player's name). -
associated
: the associated team or player of the tuple.
The above three fields of each
x
instance are stored in three parallel files, respectively. For example, each line in the filetrain/x_type.train.txt
contains data types of all tuples in eachx
training instance. Data types are separated by white spaces. For example, the first line intrain/x_type.train.txt
isTEAM_NAME TEAM-AST TEAM-AST TEAM_NAME
, meaning that there are 4 tuples in the firstx
instance, each of which has the respective type.We also provide joined files of
x
. For example, each line intrain/x.joined.train.txt
contains all tuples in eachx
training instance. In each tuple, the three fields are joined, separated by|
. For example, the first line intrain/x.joined.train.txt
isBoston|TEAM_NAME|Boston 25|TEAM-AST|Boston 11|TEAM-AST|New_York New_York|TEAM_NAME|New_York
. These files are simply joined from the separated files, and only used when evaluating the results. -
-
y_aux
is the auxiliary sentence describing the content ofx
. -
x_ref
is the content record of reference sentencey_ref
, in the same format asx
. During data construction, we have guaranteedx_ref
has a similar structure withx
, but has a different number of tuples or has different values or types. -
y_ref
is the reference sentence that defines the desired writing style of output sentence.
Data Files
-
The dataset is split into train/val/test sets, each in corresponding folder, respectively.
-
The four elements
(x, y_aux, x_ref, y_ref)
of each example are stored in parallel files, respectively. For example, each line oftrain/y_aux.train.txt
is an auxiliary sentence of the respective data example.As explained above, three fields of
x
are separately stored in three files, namely, (taking training data for example),x_type.train.txt
,x_value.train.txt
, andx_associated.train.txt
, respectively. Also, joined tuples ofx
are stored in a single file, namely, (again taking training data for example),x.joined.train.txt
.x_ref
is stored in the same format, in files likex_ref_type.train.txt
orx_ref.joined.train.txt
. -
The vocabulary file
y.vocab.txt
contains all words that have occurred iny_aux
andy_ref
.x_type.vocab.txt
,x_value.vocab.txt
, andx_associated.vocab.txt
are the vocabulary of the 'type', 'value', and 'associated' fields of bothx
andx_ref
.
Data Statistics
train | valid | test | |
---|---|---|---|
#Instances | 31,751 | 6,833 | 6,999 |
#Tokens | 1.64M | 0.35M | 0.36M |
Avg Sentence Length | 25.90 | 25.87 | 25.99 |
#Data Types | 34 | 34 | 34 |
Avg Record Length | 4.88 | 4.88 | 4.94 |
Dataset Creation Process
We briefly describe the process of creating the above dataset.
This dataset is derived from one of the Data-to-Text Datasets (RotoWire) proposed in the paper (Wiseman et al., 2017) Challenges in Data-to-Document Generation, which is for NBA game report generation. The original data can be downloaded from here.
The original dataset consists of (table, paragraph) pairs. We first split each data example into (record, sentence) pairs:
-
The original dataset is then preprocessed with a modified version of the script provided in the Data-to-Text dataset. In this step, we make sure each name of an entity (team/city/player) become a single token (e.g.,
LeBron_James
,Los_Angeles_Clippers
), and all numbers are replaced by their digital forms (e.g., if the original text isfifty
, we replace it with50
). -
We split the paragraph in each data example into sentences, i.e., the
y_aux
. -
We then use the above script to extract all candidate relations between entities and numbers in each sentence
y_aux
. More rule-based constraints are imposed to filter out as many redundant relations as possible. These extracted relations forms the recordx
. So far, we have obtained all(x, y_aux)
pairs.
We next use a retrieval method to retrieve from the training set a (x_ref, y_ref)
pair for each of the above (x, y_aux)
pairs. In particular, as mentioned above, we want to guarantee x_ref
has a similar but not exact the same content with x
. Formally, we use the following criteria for retrieval:
where types(x)
is the set of all data types in record x
; J(A, B)
is the Jaccard index between two sets A and B. The larger J(A, B)
is, the closer A and B are. When J(A, B) = 1
, A is exactly the same as B, otherwise there is some difference between them. We measure similarity between two records based on their types. Therefore, our criteria find x_ref
that is most similar to but not exactly the same with x
.