neural-editor icon indicating copy to clipboard operation
neural-editor copied to clipboard

Release datasets

Open kelvinguu opened this issue 6 years ago • 9 comments

  • the dataset of (prototype, revision) sentence pairs
  • the sentence analogy evaluation

kelvinguu avatar Oct 08 '17 05:10 kelvinguu

Any idea when this might be released?

ghost avatar Oct 19 '17 14:10 ghost

We're working on a clean codalab worksheet with everything in it, but progress has been slow due multiple conference deadlines these last few weeks and the difficulty of getting our LSH to play nice with docker.

If you don't want to wait and want to use our disorganized upload files, you can use the following temporary links for the datasets (yelp , 1b-words).

To run, make a directory somewhere (say /editor) and dump the data files into subdir (/editor/yelp_data) and also get the word vectors (word vectors ) and put it in /editor/word_vectors. You will additionally need to make the output directory (/editor/edit_runs) and a userid mapping in /editor/user_ids.json (looks like {"YOUR_UID": "YOUR_USERNAME"} ).

if you set TEXTMORPH_DATA=/editor and invoke run_docker.py with the config file using 'yelp_data' for the path, it should work (untested outside our cluster, let us know if this fails).

thashim avatar Oct 19 '17 23:10 thashim

Thanks for the update.

Do these links have the analogy data as well ?

lajanugen avatar Oct 19 '17 23:10 lajanugen

The links above are only the training pairs. I wrote a quick script to dump the analogies here.

Let me know if there's anything broken with this one, since I just made the script to dump this.

thashim avatar Oct 20 '17 00:10 thashim

Thank you! The data file in the link has 1300 analogy tuples (making sure nothing's broken).

I would appreciate it if you could comment on the following questions regarding the evaluation

  • I'm assuming in each of the four sentence in each line, the first three were treated as query and last one as the answer. Is that how the evaluation was performed ?
  • What is the candidate set of sentences used for retrieval ?

lajanugen avatar Oct 20 '17 00:10 lajanugen

1300 should be right - 13 categories and 100 sentences each.

In each line we have S1, S2 , S3, S4. The analogy is S3:S4::S1: ?? (sorry, unconventional ordering on my part).

There's no candidate sentence set, since for our model we generate a sentence and check for perfect word overlap, and for the word vector baseline we just do the lexical, rather than sentence, baseline (ie the analogy over the one word that changes between S1-S2 and S3-S4).

thashim avatar Oct 20 '17 00:10 thashim

Thank you for the clarifications!

I assumed it was a retrieval task as I saw 'top 10' in the results table. After reading your comment and the paper again I now understand that it is treated as a generation problem.

Thank you again for the data and the quick responses!

lajanugen avatar Oct 20 '17 01:10 lajanugen

No problem. It's kind of confusing as a task since we're doing sentence analogies, but only one word changes as well. We may end up changing the section / task to be clearer as the paper goes through review.

thashim avatar Oct 20 '17 01:10 thashim

Hi,

Could you please provide the content of yelp_dataset_static directory ? Or it's a just plain yelp dataset ?

Thanks.

python main.py ../../configs/language_model/default.txt 
TrainingRun configuration:
optim {
  seed = 0
  learning_rate = 0.001
  batch_size = 128
  max_iters = 400000
}
eval {
  num_examples = 32
  big_num_examples = 128
  eval_steps = 500
  big_eval_steps = 5000
  save_steps = 5000
  alive_steps = 30
}
model {
  vocab_size = 10000
  word_dim = 300
  agenda_dim = 100
  hidden_dim = 100
  num_layers = 3
  kl_weight_steps = 50000
  kl_weight_rate = 8
  kl_weight_cap = 1.0
  dci_keep_rate = 0.8
  wvec_path = "glove.6B.300d_yelp.txt"
  type = 0
}
dataset {
  path = "yelp_dataset_static"
}
Loading embeddings from /data/word_vectors/glove.6B.300d_yelp.txt:  95%|##################9 | 9483/10000 [00:01<00:00, 7882.85it/s]
No GPUs detected. Sticking with CPUs.
No checkpoint to reload. Initializing fresh.
[localhost] local: wc -l /data/yelp_dataset_static/train.txt

Fatal error: local() encountered an error (return code 1) while executing 'wc -l /data/yelp_dataset_static/train.txt'

leodesigner avatar Oct 30 '17 12:10 leodesigner