tensorflow-seq2seq-tutorials icon indicating copy to clipboard operation
tensorflow-seq2seq-tutorials copied to clipboard

Increasing max sequence length and vocab size

Open kuhanw opened this issue 7 years ago • 1 comments

Dear experts,

Thank you for this excellent tutorial. This is one of the first seq2seq tutorials I have read that has really helped me to internalize some of the concepts (and that I could get running off the ground without too much trouble!).

I had question, I am currently working on the first tutorial notebook: "1-seq2seq" and trying to understand the relationship between model performance and sequence length and vocab size. In a real world example it may be possible to control sequence length by limiting the sentences to certain sizes but it would surely not be possible to significantly reduce the vocabulary below a threshold.

Indeed at the end of the tutorial it is suggested to play around with these parameters to observe how training speed and quality degrades.

My question is: how would I best translate the toy model of predicting a random sequence of limited sequence of 2-8 and vocab between 1-10 to a more realistic scenario where the vocab can be thousands of terms?

Currently I am simply trying to extend the problem to predicting a random sequence of numbers between 2-5000 instead of 2-10 and play around with the hyperparameters to figure out which ones will help increase the quality of my results.

Is there any intution towards understanding how the embedding size, # encoder units affect model quality? I already noticed that batch size directly effects the quality when the sequence length increases.

Thank you!

Kuhan

kuhanw avatar Aug 18 '17 14:08 kuhanw

My goal with tutorials was to keep full train runtime under a few minutes so that users might play with them more or less interactively. To scale up to real problem you simply get fast GPU with lots of memory, and extend the vocabulary and and sequence lengths.

Word-level level language tasks usually have no problem with output vocabs around 30-40k words (some papers even go to 100k and bigger). Computationally, bottleneck is usually in the output softmax layer, where sampled softmax often helps. Input vocabulary usually isn't a limiting factor, since its memory consumption is linear in vocabulary size and usually done on CPU anyway.

Besides sampling softmax during training, couple of things could help:

  • consider character/byte-level models, they have small vocabs.
  • consider transforming long-tail words into subword units.

BTW be wary of borrowing too much intuition from playing with hyperparams on toy examples when scaling to real-world problems, beyond simple "how-it-works" intuitions. I'm still not sure how such transfer should work.

ematvey avatar Aug 20 '17 11:08 ematvey