tensorflow-CWS-LSTM
tensorflow-CWS-LSTM copied to clipboard
What is windowed LSTM model?
Q1: What is windowed LSTM model?Is it the model that this paper presented?
Q2: How to transform the raw data like this 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ( 附 图片 1 张 ) to your training data-sets' form?
For Q1: it's what this descirbed. windowed means that you put surrounding word together as input vector. For example, if you have sentence'开心啊' -> [1,2,3](refers to word index in vocabulary table), you will be transforming this sentence to [[B,1,2],[1,2,3],[2,3,E]] as input(B,E are padding index). Finally this will turn into [[embed(B)+embed(1)+embed(2)],[embed(1)+embed(2)+embed(3)],[embed(2)+embed(3)+embed(E)]].
For Q2: you can search online for 'CRF中文分词', refer to this article Here's the code you can use.
#!/usr/bin/env python
#-*-coding:utf-8-*-
#4-tags for character tagging: B(Begin),E(End),M(Middle),S(Single)
import codecs
import sys
def character_tagging(input_file, output_file):
input_data = codecs.open(input_file, 'r', 'utf-8')
output_data = codecs.open(output_file, 'w', 'utf-8')
for line in input_data.readlines():
word_list = line.strip().split()
for word in word_list:
if len(word) == 1:
output_data.write(word + "\tS\n")
else:
output_data.write(word[0] + "\tB\n")
for w in word[1:len(word)-1]:
output_data.write(w + "\tM\n")
output_data.write(word[len(word)-1] + "\tE\n")
output_data.write("\n")
input_data.close()
output_data.close()
if __name__ == '__main__':
if len(sys.argv) != 3:
print ("Usage: python " + argv[0] + " input output")
sys.exit(-1)
input_file = sys.argv[1]
output_file = sys.argv[2]
character_tagging(input_file, output_file)
Thank you for you help!I test it,there are some bugs,use training data you given,and use order python model_LSTM.py--train=data/trainSeg.txt --model=model --iters=50
tensorflow version:0.11.0rc1 numpy version:1.11.2 gensim version:0.13.3
Here is log
embedding:data/char2vec_50.model`
max len 150
stddev:0.100000
hi
Vocab Size: 7008
Training Samples: 100
Valid Samples 100
Layers:1
Hidden Size: 50
Embedding size: 50
Window Size 4
Norm 7
num of batches 5
process:0.000 ErrorRate: 41.291848 cost 0.685807 132 wps
num of batches 0
Epoch: 1 Train accuracy: 0.606
Traceback (most recent call last):
File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 367, in
Try lstm_build. I forgot to delete the old one. It doesn't work.
Also, I've modified the command in README
you should use -
python model_lstm_build.py --train=data/trainSeg.txt --model=model --iters=50 instead.
It's because I haven't modified this repo for a long time, so many things have changed...
there are two tiny bugs in lstm_build.py interrupted
1.Indentation block Error in line 108
2.the decorator of lazy_property Error in line 84
Maybe you could fix it next time you push
I've fixed it in my new commit last night. you can try pulling it. There are some version control issues that kept me from uploading the bug-free code...
Thanks,I infer that you used word2vec model in package of gensim to get .model. I want to get the parameter of this class:
class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)
if it's convenient to you to leave your code,it will be very helpful.
Thank you very much!