generating-reviews-discovering-sentiment icon indicating copy to clipboard operation
generating-reviews-discovering-sentiment copied to clipboard

Can it work on chinese ? how can I train my chinese text dataset to use this?

Open suparek opened this issue 7 years ago • 6 comments

wish reply

suparek avatar Sep 25 '17 05:09 suparek

check issue #30

alaakh42 avatar Mar 25 '18 12:03 alaakh42

Wish you got some progress on this topic, I am also interested.

As far as I know, the core part for mLSTM is to train the model with utf-8 encoded sequence.

If you look into the code in utils.py, in the preprocess() function,

def preprocess(text, front_pad='\n ', end_pad=' '):
    text = html.unescape(text)
    text = text.replace('\n', ' ').strip()
    text = front_pad+text+end_pad
    text = text.encode()
    return text

So, if you figure out how to convert the Chinese charactor into utf-8 encode, you shall be able to feed the sequence into the model for training.

ttt = u'年集中发力的领域'

ttt
Out[50]: '年集中发力的领域'

type(ttt)
Out[51]: str

encoded_ttt = ttt.encode("utf-8")
encoded_ttt
Out[53]: b'\xe5\xb9\xb4\xe9\x9b\x86\xe4\xb8\xad\xe5\x8f\x91\xe5\x8a\x9b\xe7\x9a\x84\xe9\xa2\x86\xe5\x9f\x9f'
for word in encoded_ttt.decode("utf-8"):
    print(word)
年
集
中
发
力
的
领
域

gitathrun avatar Apr 05 '18 09:04 gitathrun

hi @gitathrun. Are you using python 2 or 3?

jonny-d avatar Apr 05 '18 16:04 jonny-d

@jonnykira python 3.5

gitathrun avatar Apr 05 '18 16:04 gitathrun

Cool, that should work then. For python 2 you would also have to convert the UTF-8 string to a bytearray object within preprocess().

Out of curiosity have you successfully trained a model on Chinese data?

jonny-d avatar Apr 05 '18 18:04 jonny-d

@jonnykira Firstly, thanks for your code on mlstm, very sleek and well formed tensorflow code. No, I have not done any yet, but I will train a Chinese based model sometimes in the future.

gitathrun avatar Apr 05 '18 21:04 gitathrun