stanford-tensorflow-tutorials icon indicating copy to clipboard operation
stanford-tensorflow-tutorials copied to clipboard

Encoding problem in "11_char_rnn_gist.py" example

Open alanwang93 opened this issue 8 years ago • 1 comments

Hi, I'm reading the code of 11_char_rnn_gist.py, and I found the following problem:

In line 57, we encode the sequence seq with one-hot code with depth=len(vocab).

However, seq is generated with [vocab.index(x) + 1 for x in text if x in vocab], so the code of characters is between 1 to len(vocab), then we pad them with 0. So with tf.one_hot, the last character in vocab is neglected, and the PAD symbol is encoded to [1 0 0 0 ...].

When we run the demo, it seems ok because the last character } hardly appears in our dataset. If we change vocab from (let a be the last character in 'vocab')

    vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}")

to

    vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_bcdefghijklmnopqrstuvwxyz{|}a")

Then the outputs are completely non sense (it outputs something like T8WrtP sVM -reca5 r, ...), while it should work as before.

alanwang93 avatar Apr 11 '17 14:04 alanwang93

@alanwang93 Thanks Alan. Let me look into it.

chiphuyen avatar Jul 11 '17 17:07 chiphuyen