char-rnn
char-rnn copied to clipboard
Add utf8 character support
I tested by feeding the model a Chinese novel, and it produces some interesting results.
Wow, this is just what I'm going to do, thank you.
@VitoVan You can also try the original code on byte input without any modification.
In my experiment, the trained LSTM model can actually learn the utf-8 encoding of chinese character. I didn't see any broken codepoint in the generated text.
------ADDED 2015-6-12 16:43:56------ Sorry, I'm new to Lua, so I may have follow stupid question: ------ADDED 2015-6-12 16:43:56------
@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?
I assume this code is backwards compatible to previous datasets?
I think so, haven't test it.
Sent from my phone. On 17 Jun 2015 6:30 pm, Andrej [email protected] wrote:I assume this code is backwards compatible to previous datasets?
—Reply to this email directly or view it on GitHub.
This patch increases the size of vocab a lot. I have a dataset of 16M. The origin code generates a vocab with size 230 but this code generates a vocab with 180128, which need 241G memory to load.
I just realize that my dataset is not UTF8. But this may break the support for other input stream than text. And the vocab generated from UTF8 dataset is also bigger that the origin size.
@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?
Presumably the advantage is that the model doesnt have to spend effort on learning how to construct unicode code points, and wont ever write invalid unicode code points.
But the increase in vocab size increase will vastly increase the number of parameters in the fully-connected Linear layers, as far as I can see. Based on my calcs at https://www.reddit.com/r/MachineLearning/comments/3ejizl/karpathy_charrnn_doubt/ctfndk6 , the number of weights is:
4 * rnn_size * ( vocab_size + rnn_size + 2 ) + (rnn_size + 1) * vocab_size
eg, if rnn_size is 128, and vocab_size is eg 96 then the number of weights is: 128K, which takes 512KB of memory (4 bytes per float)
but if vocab_size is 180,128, then the number of weights is: 115M, which takes 460MB of memory
Hmmm, but actually, I dont remember there are so many chinese characters. I think there are only 10 to 20 thousand in normal usage?
What is the status on this?