char-rnn Add utf8 character support

I tested by feeding the model a Chinese novel, and it produces some interesting results.

May 27 '15 07:05 5kg

Wow, this is just what I'm going to do, thank you.

Jun 12 '15 00:06 VitoVan

@VitoVan You can also try the original code on byte input without any modification.

In my experiment, the trained LSTM model can actually learn the utf-8 encoding of chinese character. I didn't see any broken codepoint in the generated text.

Jun 12 '15 06:06 5kg

------ADDED 2015-6-12 16:43:56------ Sorry, I'm new to Lua, so I may have follow stupid question: ------ADDED 2015-6-12 16:43:56------

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

Jun 12 '15 06:06 VitoVan

I assume this code is backwards compatible to previous datasets?

Jun 17 '15 10:06 karpathy

I think so, haven't test it.

Sent from my phone. On 17 Jun 2015 6:30 pm, Andrej [email protected] wrote:I assume this code is backwards compatible to previous datasets?

—Reply to this email directly or view it on GitHub.

Jun 17 '15 11:06 VitoVan

This patch increases the size of vocab a lot. I have a dataset of 16M. The origin code generates a vocab with size 230 but this code generates a vocab with 180128, which need 241G memory to load.

Jun 17 '15 13:06 wb14123

I just realize that my dataset is not UTF8. But this may break the support for other input stream than text. And the vocab generated from UTF8 dataset is also bigger that the origin size.

Jun 17 '15 13:06 wb14123

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

Presumably the advantage is that the model doesnt have to spend effort on learning how to construct unicode code points, and wont ever write invalid unicode code points.

But the increase in vocab size increase will vastly increase the number of parameters in the fully-connected Linear layers, as far as I can see. Based on my calcs at https://www.reddit.com/r/MachineLearning/comments/3ejizl/karpathy_charrnn_doubt/ctfndk6 , the number of weights is:

4 * rnn_size * ( vocab_size + rnn_size + 2 ) + (rnn_size + 1) * vocab_size

eg, if rnn_size is 128, and vocab_size is eg 96 then the number of weights is: 128K, which takes 512KB of memory (4 bytes per float)

but if vocab_size is 180,128, then the number of weights is: 115M, which takes 460MB of memory

Nov 15 '15 09:11 hughperkins

Hmmm, but actually, I dont remember there are so many chinese characters. I think there are only 10 to 20 thousand in normal usage?

Nov 15 '15 10:11 hughperkins

What is the status on this?

Dec 31 '17 03:12 InnovativeInventor

char-rnn char-rnn copied to clipboard

Add utf8 character support

I think so, haven't test it.

char-rnn
char-rnn copied to clipboard