neural-editor
neural-editor copied to clipboard
Release datasets
- the dataset of (prototype, revision) sentence pairs
- the sentence analogy evaluation
Any idea when this might be released?
We're working on a clean codalab worksheet with everything in it, but progress has been slow due multiple conference deadlines these last few weeks and the difficulty of getting our LSH to play nice with docker.
If you don't want to wait and want to use our disorganized upload files, you can use the following temporary links for the datasets (yelp , 1b-words).
To run, make a directory somewhere (say /editor) and dump the data files into subdir (/editor/yelp_data) and also get the word vectors (word vectors ) and put it in /editor/word_vectors. You will additionally need to make the output directory (/editor/edit_runs) and a userid mapping in /editor/user_ids.json (looks like {"YOUR_UID": "YOUR_USERNAME"} ).
if you set TEXTMORPH_DATA=/editor and invoke run_docker.py with the config file using 'yelp_data' for the path, it should work (untested outside our cluster, let us know if this fails).
Thanks for the update.
Do these links have the analogy data as well ?
The links above are only the training pairs. I wrote a quick script to dump the analogies here.
Let me know if there's anything broken with this one, since I just made the script to dump this.
Thank you! The data file in the link has 1300 analogy tuples (making sure nothing's broken).
I would appreciate it if you could comment on the following questions regarding the evaluation
- I'm assuming in each of the four sentence in each line, the first three were treated as query and last one as the answer. Is that how the evaluation was performed ?
- What is the candidate set of sentences used for retrieval ?
1300 should be right - 13 categories and 100 sentences each.
In each line we have S1, S2 , S3, S4. The analogy is S3:S4::S1: ?? (sorry, unconventional ordering on my part).
There's no candidate sentence set, since for our model we generate a sentence and check for perfect word overlap, and for the word vector baseline we just do the lexical, rather than sentence, baseline (ie the analogy over the one word that changes between S1-S2 and S3-S4).
Thank you for the clarifications!
I assumed it was a retrieval task as I saw 'top 10' in the results table. After reading your comment and the paper again I now understand that it is treated as a generation problem.
Thank you again for the data and the quick responses!
No problem. It's kind of confusing as a task since we're doing sentence analogies, but only one word changes as well. We may end up changing the section / task to be clearer as the paper goes through review.
Hi,
Could you please provide the content of yelp_dataset_static directory ? Or it's a just plain yelp dataset ?
Thanks.
python main.py ../../configs/language_model/default.txt
TrainingRun configuration:
optim {
seed = 0
learning_rate = 0.001
batch_size = 128
max_iters = 400000
}
eval {
num_examples = 32
big_num_examples = 128
eval_steps = 500
big_eval_steps = 5000
save_steps = 5000
alive_steps = 30
}
model {
vocab_size = 10000
word_dim = 300
agenda_dim = 100
hidden_dim = 100
num_layers = 3
kl_weight_steps = 50000
kl_weight_rate = 8
kl_weight_cap = 1.0
dci_keep_rate = 0.8
wvec_path = "glove.6B.300d_yelp.txt"
type = 0
}
dataset {
path = "yelp_dataset_static"
}
Loading embeddings from /data/word_vectors/glove.6B.300d_yelp.txt: 95%|##################9 | 9483/10000 [00:01<00:00, 7882.85it/s]
No GPUs detected. Sticking with CPUs.
No checkpoint to reload. Initializing fresh.
[localhost] local: wc -l /data/yelp_dataset_static/train.txt
Fatal error: local() encountered an error (return code 1) while executing 'wc -l /data/yelp_dataset_static/train.txt'