stanford-tensorflow-tutorials
stanford-tensorflow-tutorials copied to clipboard
Encoding problem in "11_char_rnn_gist.py" example
Hi, I'm reading the code of 11_char_rnn_gist.py, and I found the following problem:
In line 57, we encode the sequence seq with one-hot code with depth=len(vocab).
However, seq is generated with [vocab.index(x) + 1 for x in text if x in vocab], so the code of characters is between 1 to len(vocab), then we pad them with 0. So with tf.one_hot, the last character in vocab is neglected, and the PAD symbol is encoded to [1 0 0 0 ...].
When we run the demo, it seems ok because the last character } hardly appears in our dataset.
If we change vocab from (let a be the last character in 'vocab')
vocab = (
" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"\\^_abcdefghijklmnopqrstuvwxyz{|}")
to
vocab = (
" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"\\^_bcdefghijklmnopqrstuvwxyz{|}a")
Then the outputs are completely non sense (it outputs something like T8WrtP sVM -reca5 r, ...), while it should work as before.
@alanwang93 Thanks Alan. Let me look into it.