tensorflow-CWS-LSTM icon indicating copy to clipboard operation
tensorflow-CWS-LSTM copied to clipboard

What is windowed LSTM model?

Open lengyuewuyazui opened this issue 8 years ago • 6 comments

Q1: What is windowed LSTM model?Is it the model that this paper presented?

Q2: How to transform the raw data like this 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ( 附 图片 1 张 ) to your training data-sets' form?

lengyuewuyazui avatar Nov 23 '16 06:11 lengyuewuyazui

For Q1: it's what this descirbed. windowed means that you put surrounding word together as input vector. For example, if you have sentence'开心啊' -> [1,2,3](refers to word index in vocabulary table), you will be transforming this sentence to [[B,1,2],[1,2,3],[2,3,E]] as input(B,E are padding index). Finally this will turn into [[embed(B)+embed(1)+embed(2)],[embed(1)+embed(2)+embed(3)],[embed(2)+embed(3)+embed(E)]].

For Q2: you can search online for 'CRF中文分词', refer to this article Here's the code you can use.

#!/usr/bin/env python
#-*-coding:utf-8-*-

#4-tags for character tagging: B(Begin),E(End),M(Middle),S(Single)

import codecs
import sys

def character_tagging(input_file, output_file):
	input_data = codecs.open(input_file, 'r', 'utf-8')
	output_data = codecs.open(output_file, 'w', 'utf-8')
	for line in input_data.readlines():
		word_list = line.strip().split()
		for word in word_list:
			if len(word) == 1:
				output_data.write(word + "\tS\n")
			else:
				output_data.write(word[0] + "\tB\n")
				for w in word[1:len(word)-1]:
					output_data.write(w + "\tM\n")
				output_data.write(word[len(word)-1] + "\tE\n")
		output_data.write("\n")
	input_data.close()
	output_data.close()

if __name__ == '__main__':
	if len(sys.argv) != 3:
		print ("Usage: python " + argv[0] + " input output")
		sys.exit(-1)
	input_file = sys.argv[1]
	output_file = sys.argv[2]
	character_tagging(input_file, output_file)

elvinpoon avatar Nov 23 '16 09:11 elvinpoon

Thank you for you help!I test it,there are some bugs,use training data you given,and use order python model_LSTM.py--train=data/trainSeg.txt --model=model --iters=50

tensorflow version:0.11.0rc1 numpy version:1.11.2 gensim version:0.13.3

Here is log embedding:data/char2vec_50.model` max len 150 stddev:0.100000 hi Vocab Size: 7008 Training Samples: 100 Valid Samples 100 Layers:1 Hidden Size: 50 Embedding size: 50 Window Size 4 Norm 7 num of batches 5 process:0.000 ErrorRate: 41.291848 cost 0.685807 132 wps num of batches 0 Epoch: 1 Train accuracy: 0.606 Traceback (most recent call last): File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 367, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 360, in main valid_accuracy = run_epoch(session, m, valid_data, tf.no_op(),cmodel, verbose=False) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 241, in run_epoch model.config.left_window,model.config.right_window,num_class=model.config.num_class)): File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/utils.py", line 146, in batch_iter x, y, l, size = generate_batches(data, batch_size, num_steps, char_embedding, num_class, left, right) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/utils.py", line 213, in generate_batches x[n_batch][batch_cnt][pos] = new_x IndexError: index 0 is out of bounds for axis 0 with size 0

lengyuewuyazui avatar Nov 23 '16 14:11 lengyuewuyazui

Try lstm_build. I forgot to delete the old one. It doesn't work. Also, I've modified the command in README you should use - python model_lstm_build.py --train=data/trainSeg.txt --model=model --iters=50 instead. It's because I haven't modified this repo for a long time, so many things have changed...

elvinpoon avatar Nov 23 '16 14:11 elvinpoon

there are two tiny bugs in lstm_build.py interrupted 1.Indentation block Error in line 108
2.the decorator of lazy_property Error in line 84 Maybe you could fix it next time you push

lengyuewuyazui avatar Nov 23 '16 15:11 lengyuewuyazui

I've fixed it in my new commit last night. you can try pulling it. There are some version control issues that kept me from uploading the bug-free code...

elvinpoon avatar Nov 24 '16 02:11 elvinpoon

Thanks,I infer that you used word2vec model in package of gensim to get .model. I want to get the parameter of this class: class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)

if it's convenient to you to leave your code,it will be very helpful.

Thank you very much!

lengyuewuyazui avatar Nov 24 '16 12:11 lengyuewuyazui